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ABSTRACT 


A high-performance cache is disclosed. The cache is 
designed for time- and space -efficiency for a diverse range 
of information objects. Information objects are stored in 
portions of a non-volatile storage device called arenas, 
which are contiguous regions from which space is allocated 
in parallel. Objects are substantially contiguously allocated 
within an arena and are mapped by name keys and content- 
based object keys to a tag table, an open directory, and a 
directory table. The tag table is indexed by the name keys, 
and stores references to sets in the directory table. The tag 
table is compact and therefore can be stored in fast main 
memory, facilitating rapid lookups. The directory table is 
organized so that at least a frequently-accessed portion of it 
also usually resides in fast main memory, which further 
speeds lookups. The tag and directory tables are organized 
to quickly determine non-presence of objects. Large objects 
are chunked into fragments, which are chained using a 
forward functional-iteration mechanism, to prevent the need 
for mutating existing on-disk data structures. Garbage col- 
lection periodically moves objects within an arena or to 
other arenas. Additionally, for a plurality of counters, the 
following is computed: (1) the sum of values stored in the 
counters, and (2) the maximum value that can be represented 
by the coimters. Each of the counters are decremented when 
the sum is greater than half of the maximum value. Each of 
the counters is associated with an information object, which 
is deleted when a counter is decremented to zero. 

16 Claims, 26 Drawing Sheets 
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MAINTAINING COUNTERS FOR HIGH 
PERFORMANCE OBJECT CACHE 

RELATED APPLICATIONS 

This application is a divisional application of U.S. appli- 
cation Ser. No. 09/060,866, entitled "High Performance 
Object Cache", filed by Peter Mattis, John Plevyak, Mat- 
thew Haines, Adam Beguelin, Brian Totty, and David 
Gourley, filed on Apr. 15, 1998, now U.S. Pat. No. 6,128,623 
the contents of which are incorporated by reference. 

FIELD OF THE INVENTION 

Hie present invention relates to information delivery, and 
relates more specifically to a cache for information objects 
that are to be delivered efficiently and at high speed over a 
network to a client. 

BACKGROUND OF THE INVENTION 

Several important computer technologies rely, to a great 
extent, upon rapid delivery of information from a central 
storage location to remote devices. For example, in the 
client/server model of computing, one or more servers are 
used to store information. Client computers or processes are 
separated from the servers and are connected to the servers 
using a network. The clients request information from one of 
the servers by providing a network address of the informa- 
tion. The server locates the information based on the pro- 
vided network address and transmits it over the network to 
the client, completing the transaction. 

The World Wide Web is a popular application of the 
client/server computing model. FIG. 1 is a simplified block 
diagram of the relationship between elements used in a Web 
system. One or more web clients 10a, 10b 9 each of which is 
a computer or a software process such as a browser program, 
are connected to a global information network 20 called the 
Internet, either directly or through an intermediary such as 
an Internet Service Provider, or an online information ser- 
vice. 

A web server 40 is likewise connected to the Internet 20 
by a network link 42. The web server 40 has one or more 
internet network addresses and textual host names, associ- 
ated in an agreed-upon format that is indexed at a central 
Domain Name Server (DNS). The server contains multime- 
dia information resources, such as documents and images, to 
be provided to clients upon demand. The server 40 may 
additionally or alternatively contain software for dynami- 
cally generating such resources in response to requests. 

The clients 10a, 10b and server 40 communicate using 
one or more agreed-upon protocols that specify the format of 
the information that is communicated. A client 10a looks up 
network address of a particular server using DNS and 
establishes a connection to the server using a communica- 
tion protocol called the Hypertext Transfer Protocol 
(HTTP). A Uniform Resource Locator (URL) uniquely 
identifies each information object stored on or dynamically 
generated by the server 40. A URL is a form of network 
address that identifies the location of information stored in 
a network. 

A key factor that limits the performance of the World 
Wide Web is the speed with which the server 40 can supply 
information to a client via the Internet 20. Performance is 
limited by the speed, reliability, and congestion level of the 
network route through the Internet, by geographical distance 
delays, and by server load level. Accordingly, client trans- 
action time can be reduced by storing replicas of popular 
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information objects in repositories geographically dispersed 
from the server. Each local repository for object replicas is 
generally referred to as a cache. A client may be able to 
access replicas from a topologically proximate cache faster 

5 than possible from the original web server, while at the same 
time reducing Internet server traffic. 

In one arrangement, as shown in FIG. 1, the cache is 
located in a proxy server 30 that is logically interposed 
between the clients 10a, 10b and the server 40. The proxy 

30 server provides a "middleman" gateway service, acting as a 
server to the client, and a client to the server. A proxy server 
equipped with a cache is called a caching proxy server, or 
commonly, a "proxy cache". 
The proxy cache 30 intercepts requests for resources that 

15 are directed from the clients 10a, 10b to the server 40, When 
the cache in the proxy 30 has a replica of the requested 
resource that meets certain freshness constraints, the proxy 
responds to the clients 10a, 10b and serves the resource 
directly. In this arrangement, the number and volume of data 

20 transfers along the link 42 are greatly reduced. As a result, 
network resources or objects are provided more rapidly to 
the clients 10a, 10b. 

A key problem in such caching is the efficient storage, 
location, and retrieval of objects in the cache. This document 

25 concerns technology related to the storage, location, and 
retrieval of multimedia objects within a cache. The object 
storage facility within a cache is called a "cache object 
store" or "object store". 

3Q To effectively handle heavy traffic environments, such as 
the World Wide Web, a cache object store needs to be able 
to handle tens or hundreds of millions of different objects, 
while storing, deleting, and fetching the objects simulta- 
neously. Accordingly, cache performance must not degrade 

35 significantly with object count. Performance is the driving 
goal of cache object stores. 

Finding an object in the cache is the most common 
operation and therefore the cache must be extremely fast in 
carrying out searches. The key factor that limits cache 

40 performance is lookup time. It is desirable to have a cache 
that can determine whether an object is in the cache (a "hit") 
or not (a "miss") as fast as possible. In past approaches, 
caches capable of storing millions of objects have been 
stored in traditional file system storage structures. Tradi- 

45 tional file systems are poorly suited for multimedia object 
caches because they are tuned for particular object sizes and 
require multiple disk head movements to examine file sys- 
tem metadata. Object stores can obtain higher lookup per- 
formance by dedicating DRAM memory to the task of object 

50 lookup, but because there are tens or hundreds of millions of 
objects, the memory lookup tables must be very compact. 

Once an object is located, it must be transferred to the 
client efficiently. Modern disk drives offer high performance 
when reading and writing sequential data, but suffer signifi- 

55 cant performance delays when incurring disk head move- 
ments to other parts of the disk. These disk head movements 
are called "seeks". Disk performance is typically con- 
strained by the drive's rated seeks per second. To optimize 
performance of a cache, it is desirable to minimize disk 

60 seeks, by reading and writing contiguous blocks of data. 
Eventually, the object store will become full, and particu- 
lar objects must be expunged to make room for new content. 
This process is called "garbage collection". Garbage collec- 
tion must be efficient enough that it can run continually 

65 without providing a significant decrease in system 
performance, while removing objects that have the least 
impact on future cache performance. 
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Past Approaches 

In the past, four approaches have been used to structure 
cache object stores: using the native file system, using a 
memory -blocked "page" cache, using a database, and using 
a "cyclone" circular storage structure. Each of these prior 
approaches has significant disadvantages. 

The native file system approach uses the file system of an 
operating system running on the server to create and manage 
a cache. File systems are designed for a particular applica- 
tion in mind: storing and retrieving user and system data 
files. File systems are designed and optimized for file 
management applications. They are optimized for typical 
data file sizes and for a relatively small number of files (both 
total and within one folder/directory). Traditional file sys- 
tems are not optimized to minimize the number of seeks to 
open, read/write, and close files. Many file systems incur 
significant performance penalties to locate and open files 
when there are large numbers of files present. Typical file 
systems suffer fragmentation, with small disk blocks scat- 
tered around the drive surface, increasing the number of disk 
seeks required to access data, and wasting storage space. 
Also, file systems, being designed for user data file 
management, include facilities irrelevant to cache object 
stores, and indeed counter-productive to this application. 
Examples include: support for random access and selective 
modification, file permissions, support for moving files, 
support for renaming files, and support for appending to files 
over time. File systems are also invest significant energy to 
minimize any data loss, at the expense of performance, both 
at write time, and to reconstruct the file system after failure. 
The result is that file systems are relatively poorly for 
handling the millions of files that can be present in a cache 
of Web objects. File systems don't efficiently support the 
large variation in Internet multimedia object size — in par- 
ticular they typically do not support very small objects or 
very large objects efficiently. File systems require a large 
number of disk seeks for metadata traversal and block 
chaining, poorly support garbage collection, and take time to 
ensure data integrity and to repair file systems on restart. 

The page cache extends file systems with a set of fixed 
sized memory buffers. Data is staged in and out of these 
buffers before transmission across the network. This 
approach wastes significant memory for large objects being 
sent across slow connections. 

The database system approach uses a database system as 
a cache. Generally, databases are structured to achieve goals 
that make them inappropriate for use as an object cache. For 
example, they are structured to optimize transaction pro- 
cessing. To preserve the integrity of each transaction, they 
use extensive locking. As a result, as a design goal they favor 
data integrity over performance factors such as speed. In 
contrast, it is acceptable for an object cache to lose data 
occasionally, provided that the cache does not corrupt 
objects, because the data always can be retrieved from the 
server that is original source of the data. Databases are often 
optimized for fast write performance, since write speed 
limits transaction processing speed. However, in an object 
cache, read speed is equally important. Further, databases 
are not naturally good at storing a vast variety of object sizes 
while supporting streaming, pipelined I/O in a virtual 
memory efficient manner. Databases commonly optimized 
for fixed record size sizes. Where databases support variable 
record sizes, they contain support for maintaining object 
relationships that are redundant, and typically employ slow, 
virtual memory paging techniques to support streaming, 
pipelined I/O. 
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In a cyclonic file system, data is allocated around a 
circular storage structure. When space becomes full, the 
oldest data is simply removed. This approach allows for fast 
allocation of data, but makes it difficult to support large 

5 objects without first staging them in memory, suffers prob- 
lems with fragmentation of data, and typically entails naive 
garbage collection that throws out the oldest object, regard- 
less of its popularity. For a modest, active cache with a 
diverse working set, such first-in-first -out garbage collection 

10 can throw objects out before they get to be reused. 

The fundamental problem with the above approaches for 
the design of cache object stores is that the solution isn't 
optimized for the constraints of the problem. These 
approaches all represent reapplication of existing technolo- 

15 gies to a new application. None of the applications above are 
ideally suited for the unique constraints of multimedia, 
streaming, object caches. Not only do the above solutions 
inherently encumber object caches with inefficiencies due to 
their imperfect reapplication, but they also are unable to 

20 effectively support the more unique requirements of multi- 
media object caches. These unique requirements include the 
ability to disambiguate and share redundant content that is 
identical, but has different names, and the opposite ability to 
store multiple variants of content with the same name, 

25 targeted for particular clients, languages, data types, etc. 
Based on the foregoing, there is a clear need to provide an 
object cache that overcomes the disadvantages of these prior 
approaches, and is more ideally suited for the unique 
requirements of multimedia object caches. In particular: 

30 l. there is a need for an object store that can store 
hundreds of millions of objects of disparate sizes, and 
a terabyte of content size in a memory efficient manner; 

2. there is a need for an object store that can determine if 
a document is a "hit' 3 or a "miss" quickly, without 

35 time-consuming file directory lookups; 

3. there is a need for a cache that minimizes the number 
of disk seeks to read and write objects; 

4. there is a need for an object store that permits efficient 
streaming of data to and from the cache; 

5. there is a need for an object store that supports multiple 
different versions of targeted alternates for the same 
name; 

6. there is a need for an object store that efficiently stores 
45 large numbers of objects without content duplication; 

7. there is a need for an object store that can be rapidly and 
efficiently garbage collected in real-time, insightfully 
selecting the documents to be replaced to improve user 
response speed, and traffic reduction; 

50 8. there is a need for an object store that that can restart 
to full operational capacity within seconds after soft- 
ware or hardware failure without data corruption and 
with minimal data loss. 
This document concerns technology directed to accom- 
55 plishing the foregoing goals. In particular, this document 
describes methods and structures related to the time-efficient 
and space-efficient storage, retrieval, and maintenance of 
objects in a large object store. The technology described 
herein provides for a cache object store for a high- 
50 performance, high-load application having the following 
general characteristics: 

1. High performance, measured in low latency and high 
throughput for object store operations, and large num- 
bers of concurrent operations; 
65 2. Large cache support, supporting terabyte caches and 
billions of objects, to handlle the Internet's exponential 
content growth rate; 
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3. Memory storage space efficiency, so expensive semi- 
conductor memory is used sparingly and effectively; 

4. Disk storage space efficiency, so large numbers of 
Internet object replicas can be stored within the finite 
disk capacity of the object store; 5 

5. Alias free, so that multiple objects or object variants, 
with different names, but with the same content iden- 
tical object content, will have the object content cached 
only once, shared among the different names; 

6. Support for multimedia heterogeneity, efficiently sup- 
porting diverse multimedia objects of a multitude of 
types with size ranging over six orders of magnitude 
from a few hundred bytes to hundreds of megabytes; 

7. Fast, usage-aware garbage collection, so less useful 15 
objects can be efficiently removed from the object store 

to make room for new objects; 

8. Data consistency, so programmatic errors and hardware 
failures do not lead to corrupted data; 

9. Fast restartability, so an object cache can begin servic- 20 
ing requests within seconds of restart, without requiring 

a time-consuming database or file system check opera- 
tion; 

10. Streaming, so large objects can be efficiently pipelined 
from the object store to slow clients, without staging 25 
the entire object into memory; 

11. Support for content negotiation, so proxy caches can 
efficiently and flexibly store variants of objects for the 
same URL, targeted on client browser, language, or 
other attribute of the client request; and 30 

12. General-purpose applicability, so that the object store 
interface is sufficiently flexible to meet the needs of 
future media types and protocols. 

SUMMARY OF THE INVENTION 35 

Described herein is a mechanism for maintaining counters 
stored in a computer system. According to an aspect of the 
present invention, for a plurality of counters the following is 
computed: (1) the sum of values stored in the counters, and ^ 
(2) the maximum value that can be represented by the 
counters. Each of the counters are decremented when the 
sum is greater than half of the maximum value. According 
to another aspect of the present invention, each of the 
counters is associated with an information object. When a 45 
counter is decremented to zero, the associated information 
object is deleted. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example, 50 
and not by way of limitation, in the figures of the accom- 
panying drawings and in which like reference numerals refer 
to similar elements and in which: 

FIG. 1 is a block diagram of a client/server relationship; 

FIG. 2 is a block diagram of a traffic server; 55 

FIG. 3Ais a block diagram of transformation of an object 
into a key; 

FIG. 3B is a block diagram of transformation of an object 
name into a key; 

60 

FIG. 4A is a block diagram of a cache; 

FIG. 4B is a block diagram of a storage mechanism for 
Vectors of Alternates; 

FIG. 4C is a block diagram of multi-segment directory 
table; 65 

FIG. 5 is a block diagram of pointers relating to data 
fragments; 
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FIG. 6 is a block diagram of a storage device and its 
contents; 

FIG. 7 is a block diagram showing the structure of a pool; 
FIG. 8A is a flow diagram of a process of garbage 
collection; 

FIG. SB is a flow diagram of a process of writing 
information in a storage device; 

FIG. 8C is a flow diagram of a process of synchronization; 

FIG. 8D is a flow diagram of a "checkout_read" process; 

FIG. 8E is a flow diagram of a "checkout_write" process; 

FIG. 8F is a flow diagram of a "checkout_Create" 
process; 

FIG. 9 A is a flow diagram of a cache lookup process; 

FIG. 9B is a flow diagram of a "checkin" process; 

FIG. 9C is a flow diagram of a cache lookup process; 

FIG. 9D is a flow diagram of a cache remove process; 

FIG. 9E is a flow diagram of a cache read process; 

FIG. 9F is a flow diagram of a cache write process; 

FIG. 9G is a flow diagram of a cache update process; 

FIG. lOAis a flow diagram of a process of allocating and 
writing objects in a storage device; 

FIG. 10B is a flow diagram of a process of scaled counter 
updating; 

FIG. 11 is a block diagram of a computer system that can 
be used to implement the present invention; 

FIG. 12 is a flow diagram of a process of object 
re-validation. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

A method and apparatus for caching information objects 
is described. In the following description, for the purposes of 
explanation, numerous specific details are set forth in order 
to provide a thorough understanding of the present inven- 
tion. It will be apparent, however, to one skilled in the art 
that the present invention may be practiced without these 
specific details. In other instances, well-known structures 
and devices are shown in block diagram form in order to 
avoid unnecessarily obscuring the present invention. 

Traffic Server 

FIG. 2 is a block diagram of the general structure of 
certain elements of a proxy 30. In one embodiment, the 
proxy 30 is called a traffic server and comprises one or more 
computer programs or processes that operate on a computer 
workstation of the type described further below. A client 10a 
directs a request 50 for an object to the proxy 30 via the 
Internet 20. In this context, the term "object" means a 
network resource or any discrete element of information that 
is delivered from a server. Examples of objects include Web 
pages or documents, graphic images, files, text documents, 
and objects created by Web application programs during 
execution of the programs, or other elements stored on a 
server that is accessible through the Internet 20. 
Alternatively, the client 10a is connected to the proxy 30 
through a network other than the Internet. 

The incoming request 50 arrives at an input/output (I/O) 
core 60 of the proxy 30. The I/O core 60 functions to adjust 
the rate of data received or delivered by the proxy to match 
the data transmission speed of the link between the client 
10a and the Internet 20. In a preferred embodiment, the I/O 
core 60 is implemented in the form of a circularly arranged 
set of buckets that are disposed between input buffers and 
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output buffers that are coupled to the proxy 30 and the 
Internet 20. Connections among the proxy 30 and one or 
more clients 10a are stored in the buckets. Each bucket in the 
set is successively examined, and each connection in the 
bucket is polled. During polling, the amount of information 5 
that has accumulated in a buffer associated with the con- 
nection since the last poll is determined. Based on the 
amount, a period value associated with the connection is 
adjusted. The connection is then stored in a different bucket 
that is generally identified by the sum of the current bucket 10 
number and the period value. Polling continues with the next 
connection and the next bucket. In this way, the elapsed time 
between successive polls of a connection automatically 
adjusts to the actual operating bandwidth or data commu- 
nication speed of the connection. 15 

The I/O core 60 passes the request 50 to a protocol engine 
70 that is coupled to the I/O core 60 and to a cache 80. The 
protocol engine 70 functions to parse the request 50 and 
determine what type of substantive action is embodied in the 
request 50. Based on information in the request 50, the 20 
protocol engine 70 provides a command to the cache 80 to 
carry out a particular operation. In an embodiment, the cache 
80 is implemented in one or more computer programs that 
are accessible to the protocol engine 70 using an application 
programming interface (API). In this embodiment, the pro- 25 
tocol engine decodes the request 50 and performs a function 
call to the API of the cache 80. The function call includes, 
as parameter values, information derived from the request 
50. 

The cache 80 is coupled to send and receive information 30 
to and from the protocol engine 70 and to interact with one 
or more non- volatile mass storage devices 90a-90n. In an 
embodiment, the storage devices 90a~90n are high-capacity, 
fast disk drives. The cache 80 also interacts with data tables 
82 that are described in more detail herein, 35 

Object Cache Indexing Content Indexing 

In the preferred embodiment, the cache 80 stores objects 
on the storage devices 90a-90«. Popular objects are also ^ 
replicated into a cache. In the preferred embodiment, the 
cache has finite size, and is stored in main memory or RAM 
of the proxy 30. 

Objects on disk are indexed by fixed sized locators, called 
keys. Keys are used to index into directories that point to the 45 
location of objects on disk, and to metadata about the 
objects. There are two types of keys, called "name keys" and 
"object keys" . Name keys are used to index metadata about 
a named object, and object keys are used to index true object 
content. Name keys are used to convert URLs and other 50 
information resource names into a metadata structure that 
contains object keys for the object data. As will be discussed 
subsequently, this two-level indexing structure facilitates the 
ability to associate multiple alternate objects with a single 
name, while at the same time maintaining a single copy of 55 
any object content on disk, shared between multiple different 
names or alternates. 

Unlike other cache systems that use the name or URL of 
an object as the key by which the object is referenced, 
embodiments of the invention use a "fingerprint" of the 60 
content that makes up the object itself, to locate the object. 
Keys generated from the content of the indexed object are 
referred to herein as object keys. Specifically, the object key 
56 is a unique fingerprint or compressed representation of 
the contents of the object 52. Preferably, a copy of the object 65 
52 is provided as input to a hash function 54, and its output 
is the object key 56. For example, a file or other represen- 
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tation of the object 52 is provided as input to the hash 
function, which reads each byte of the file and generates a 
portion of the object key 56, until the entire file has been 
read. In this way, an object key 56 is generated based upon 
the entire contents of the object 52 rather than its name. 
Since the keys are content-based, and serve as indexes into 
tables of the cache 80, the cache is referred to as a content- 
indexed cache. Given a content fingerprint key, the content 
can easily be found. 

In this embodiment, content indexing enables the cache 
80 to detect duplicate objects that have different names but 
the same content. Such duplicates will be detected because 
objects having identical content will hash to the same key 
value even if the objects have different names. 

For example, assume that the server 40 is storing, in one 
subdirectory, a software program comprising an executable 
file that is 10 megabytes in size, named "IE4.exe". Assume 
further that the server 40 is storing, in a different 
subdirectory, a copy of the same file, named "Internet 
Explorer.exe", The server 40 is an anonymous FTP server 
that can deliver copies of the files over an HTTP connection 
using the FTP protocol. In past approaches, when one or 
more clients request the two files, the cache stores a copy of 
each of the files in cache storage, and indexes each of the 
files under its name in the cache. As a result, the cache must 
use 20 megabytes of storage for two objects that are identical 
except for the name. 

In embodiments of the invention, as discussed in more 
detail herein, for each of the objects, the cache creates a 
name key and an object key. The name keys are created by 
applying a hash function to the name of the object. The 
object keys are created by applying a hash function to the 
content of the object. As a result, for the two exemplary 
objects described above, two different name keys are 
created, but the object key is the same. When the first object 
is stored in the cache, its name key and object key are stored 
in the cache. When the second object is stored in the cache 
thereafter, its name key is stored in the cache. However, the 
cache detects the prior identical object key entry, and does 
not store a duplicate object key entry; instead, the cache 
stores a reference to the same object key entry in association 
with the name key, and deletes the new, redundant object. As 
a result, only 10 megabytes of object storage is required. 
Thus, the cache detects duplicate objects that have different 
names, and stores only one permanent copy of each such 
object. 

FIG. 3A is a block diagram of mechanisms used to 
generate an object key 56 for an object 52. When client 10a 
requests an object 52, and the object is not found in the cache 
80 using the processes described herein, the cache retrieves 
the object from a server and generates a object key 56 for 
storing the object in the cache. 

Directories are the data structures that map keys to 
locations on disk. It is advisable to keep all or most of the 
contents of the directories in memory to provide for fast 
lookups. This requires directory entries to be small, permit- 
ting a large number of entries in a feasible amount of 
memory. Further, because 50% of the accesses are expected 
not to be stored in cache, we want to determine cache misses 
quickly, without expending precious disk seeks. Such fast 
miss optimizations dedicate scarce disk head movements to 
real data transfers, not unsuccessful speculative lookups. 
Finally, to make lookups fast via hashing search techniques, 
directory entries are fixed size. 

Keys are carefully structured to be fixed size and small, 
for the reasons described earlier. Furthermore, keys are 
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partitioned into subkcys for the purposes of storage effi- non-volatile storage devices, such as disks. FIG. 4 is a block 

ciency and fast lookups. Misses can be identified quickly by diagram of the general structure of the cache 80. The cache 

detecting differences in just a small portion of keys. For this 80 generally comprises a Tag Table 102, a Directory Table 

reason, instead of searching a full directory table containing U0, an Open Directory table 130, and a set of pools 200a 
complete keys, misses are filtered quickly using a table of 5 through 200/j, coupled together using logical references as 

small subkeys called a "tag table". Furthermore, statistical described further below 

properties of large bU vectors can be exploited to create ^ j Tab]e , 02 and ^ Direct Table u0 are 

space-emcient keys that support large numbers of cache . ,° t . . . , t , , ™ J „ *r i_i nv* 

K- . n • t organized as set associative hash tables. The Tag Table 102. 

objects with small space requirements. & _ t & . ' 

A ,. , j. . , . . ! - c the Directory Table 110, and the Open Directory table 130 

According to one embodiment, the object key 56 com- in . ; t , t , . , r . ™^ i r. tU 

* ui co j * ui en tl , ui 10 correspond to the tables 82 shown in FIG. 2. For the 

pnses a set subkey 58 and a tag subkey 59. The set subkey * 1 • • 1 « . Jt • * 

58 and tag subkey 59 comprise a subset of the bits that make purpt^s of explanation it shall be assumed that an index 
up the complete object key 56. For example, when the search is being performed based on object key 56. However, 
complete object key 56 is 128 bits in length, the subkeys 58, * the Ta 6 Table 102 and Directory Table 110 operate in the 

59 can be 16 bits, 27 bits, or any other portion of the same fashion when traversed based on a name key 62. 
complete key. The subkeys 58, 59 are used in certain 15 The Tag Table 102 is a set-associative array of sets 104a, 
operations, which are described below, in which the subkeys 104fc, through 104n. The tag table is designed to be small 
yield results that are nearly as accurate as when the complete enough to fit in main memory. Its purpose is to quickly 
key is used. In this context, "accurate" means that use of the detect misses, whereby using only a small subset of the bits 
subkeys causes a hit in the cache to the correct object as m the key a determination can be made that the key is not 
often as when the complete key is used. 20 stored ^ the cache ^ designation 104n is used to indicate 

This accuracy property is known as "smoothness" and is that no particular number of sets is required in the Tag Table 

a characteristic of a certain preferred subset of hash func- 102. As shown in the case of set 104n, each of the sets 

tions. An example of a hash function suitable for use in an 104a-104rt comprises a plurality of blocks 106, 

embodiment is the MD5 hash function, which is described In tne pre ferred embodiment, the object key 56 is 128 bits 

in detail in B. Schneier, "Applied Cryptography" (New 25 ^ i engt h. The set subkey 58 is iised to identify and select one 

York: John Wiley & Sons, Inc., 2d ed. 1996), at pp. 429-^31 of me xis I04a-104/i. Preferably, the set subkey 58 is 

and pp. 436-441. The MD5 hash function generates al28-bit approximately 18 bits in length. The tag subkey 59 is used 

key from an input data stream having an arbitrary length. to re f e rence one of the entries 106 within a selected set. 

Generally the MD5 hash function and other one-way hash Preferably, the tag subkey 59 is approximately 16 bits in 

functions are used in the cryptography field to generate ™ leagth> but may bc ^ small as ^ bits in cascs in which 

secure keys for messages or documents that are to be there are many sets In such cases? me tag uh i e would be a 

transmitted over secure channels. General hashing table bit vector. 

construction and search techniques are described in detail in ^ mechanism ^ t0 identif or refer l0 m elemenl 

D. Knuth, "TTie Art of Computer Programming: Vol. 3, may vary from imp i em entation to implementation, and may 

Sorting and Searching, at 506-549 (Reading, Mass.: 35 referenceS) pointers? or a ^^^0 

Addison-Wesley, 1973). thereof. In this context, the term "reference" indicates that 

Name Indexing one element identifies or refers to another element. A 

Unfortunately, requests for objects typically do not iden- remainder subkey 56' consists of the remaining bits of the 

tify requested objects using the object keys for the objects. 40 ke ? 56 ' ™ e ^ subke y> ta 8 subke ^ and remainder subkey 

Rather, requests typically identify requested objects by are sometimes abbreviated s, t, and r, respectively, 

name. The format of the name may vary from implementa- The preferred structure of the Tag Table 102, in which 

tion to implementation based on the environment in which eacn entrv contains a relatively small amount of information 

the cache is used. For example, the object name may be a file enables the Tag Table to be stored in fast, volatile main 

system name, a network address, or a URL. 45 memory such as RAM. Thus, the structure of the Tag Table 

According to one aspect of the invention, the object key 102 facilitates rapid operation of the cache. The blocks in the 

for a requested object is indexed under a "name key" that is Directory Table 110, on the other hand, include much more 

generated based on the object name. Thus, retrieval of an information as described below, and consequently, portions 

object in response to a request is a two phase process, where of tne Directory Table may reside on magnetic disk media as 

a name key is used to locate the object key, and the object 50 opposed to fast DRAM memory at any given time, 

key is used to locate the object itself. The Directory Table 110 comprises a plurality of sets 

FIG. 3B is a block diagram of mechanisms used to 110fl-10/i. Each of the sets UOa-llOn has a fixed size, and 

generate a name key 62 based on an object name 53. each comprises a plurality of blocks H2a-112n. In the 

According to one embodiment, the same hash function 54 preferred embodiment, there is a predetermined, constant 

that is used to generate object keys is used to generate name 55 number of sets and a predetermined, constant number of 

keys. Thus, the name keys will have the same length and blocks in cach **• As snown m the rasc of block 112n, each 

smoothness characteristics of the object keys. of me blocks U2a-112n stores a third, remainder subkey 

Similar to object key 56, the name key 62 comprises set y ah f 116 > a di f ****** value 1W, and a size value 120. 

and tag subkeys 64, 66. The subkeys 64, 66 comprise a £ toe P£ fe L rred embodiment the remainder subkey value 

subset of the bits that make up the complete name key 62. « 116 15 a 27-bit portion of thel28-bit complete object key 56, 

For example, when the-complete name key 62 is 128 bits in and me comprises bits of the complete object key 56 that are 

length, the first and second subkeys 64, 66 can be 16 bits, 27 Jsjoint from the blts that comprise the set or tag subkeys 58, 
bits, or any other portion of the complete key. 

In a search, the subkey values stored in the entry 106 of 

Searching by Objects or Name Key 65 the Tag Table 102 matc hes or references one of the sets 

Preferably, the cache 80 comprises certain data structures 110a-110/i, as indicated by the arrow in FIG. 4 that connects 

that are stored in the memory of a computer system or in its the entry 106 to the set llOd. As an example, consider the 
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12-bit key and four-bit first and second subkeys described 
above. Assume that the set subkey value 1111 matches set 
104« of the Tag Table 102, and the tag subkey value 0000 
matches entry 106 of set 104n. The match of the tag subkey 
value 0000 indicates that there is a corresponding entry in set 
UOd of the Directory Table 110 associated with the key 
prefix 11110000. When one of the sets llOa-lOn is selected 
in this manner, the blocks within the selected set are 
searched linearly to find a block, such as block 112a, that 
contains the remainder subkey value 116 that matches a 
corresponding portion of the object key 56. If a match is 
found, then there is almost always a hit in the cache. There 
is a small possibility of a miss if the first, second and third 
subkeys don't comprise the entire key. If there is a hit, the 
referenced object is then located based on information 
contained in the block, retrieved from one of the cache 
storage devices 90a-90n, and provided to the client 10a, as 
described further below. 

Unlike the Tag Table, whose job is to quickly determine 
rule out misses with the minimal use of RAM memory, each 
block within Directory Table 110 includes a full pointer to 
a disk location. The item referenced by the disk location 
value 118 varies depending on the source from which the 
key was produced. If the key was produced based on the 
content of an object, as described above, then the disk 25 
location value 118 indicates the location of a stored object 
124 (or a first fragment thereof), as shown in FIG. 4 in the 
case of block 1126. If the key is a name key, then as shown 
for block 112w, the disk location value 118 indicates the 
location of one or more Vectors of Alternates 122, each of 30 
which stores one or more object keys for the object whose 
name was used to generate the name key. A single Tag Table 
102 and a single Directory Table 110 are shown in FIG. 4 
merely by way of example. However, additional tables that 
provide additional levels of storage and indexing may be 
employed in alternate embodiments. 

In the preferred arrangement, when a search of the cache 
is conducted, a hit or miss will occur in the Tag Table 102 
very quickly. If there is a hit in the Tag Table 102, then there 
is a very high probability that a corresponding entry will 
exist in the Directory Table 110. The high probability results 
from the fact that a hit in the Tag Table 102 means that the 
cache holds an object whose full key shares X identical bits 
to the received key, where X is the number of bits of the 
concatenation of the set and tag subkeys 58 and 59. Because 
misses can be identified quickly, the cache 80 operates 
rapidly and efficiently, because hits and misses are detected 
quickly using the Tag Table 102 in memory without requir- 
ing the entire Directory Table 110 to reside in main memory. 

When the cache is searched based on object key 56, the 
set subkey 58 is used to index one of the sets 104a-104/i in 
Tag Table 102. Once the set associated with subkey 58 is 
identified, a linear search is performed through the elements 
in the set to identify an entry whose tag matches the tag 
subkey 59. 

In a search for an object 52 requested from the cache 80 
by a client 10a, when one of the sets 104a-104n is selected 
using the set subkey 58, a linear search of all the elements 
106 in that set is carried out. The search seeks a match of the 
tag subkey 59 to one the entries. If a match is found, then 
there is a hit in the Tag Table 102 for the requested object, 
and the cache 80 proceeds to seek a hit in the Directory Table 
110. 

For purposes of example, assume that the object key is a 
12-bit key having a value of 111100001010, the set subkey 
comprises the first four bits of the object key having a value 
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of 1111, and the tag subkey comprises the next four bits of 
the object key having a value of 0000. In production use the 
number of remainder bits would be significantly larger than 
the set and tag bits to affect memory savings. The cache 
identifies set 15 (1111) as the set to examine in the Tag Table 
102. Hie cache searches for an entry within that set that 
contains a tag 0000. If there is no such entry, then a miss 
occurs in the Tag Table 102. If there is such an entry, then 
the cache proceeds to check the remaining bits in Directory 
Table 110 for a match. 

Multi-Level Directory Table 

In one embodiment, the Directory Table 110 contains 
multiple sets each composed of a fixed number of elements. 
Each element contains the remainder tag and a disk pointer. 
Large caches will contain large numbers of objects, which 
will require large numbers of elements in the directory table. 
This can create tables too large to be cost-eflfectively stored 
in main memory. 

For example, if a cache was configured with 128 million 
directory table elements, and each element was represented 
by a modest 8 bytes of storage, 1 GByte of memory would 
be requires to store the directory table, which is more 
memory than is common on contemporary workstation 
computers. Because few of these objects will be actively 
accessed at any time, there is a desire to migrate the 
underutilized entries onto disk while leaving higher utilized 
entries in main memory. 

FIG. 4C is a diagram of a multi-level directory mecha- 
nism. The directory table 110 is partitioned into segments 
111 a, Ulb s 111c. In the preferred embodiments, there are 
two or three segments llla-lllc, although a larger number 
of segments may be used. The first segment 111a is the 
smallest, and fits in main memory such as the main memory 
1106 of the computer system shown in FIG. 11 and dis- 
cussed in detail below. The second and third segments 1116, 
lUc are progressively larger. The second and third segments 
111b, 111c are coupled through a paging mechanism to a 
mass storage device 1110 such as a disk. The second and 
third segments 1116, 111c dynamically page data in from the 
disk if requested data is not present in the main memory 
1106. 

As directory elements are accessed more often, the direc- 
tory elements are moved to successively higher segment 
among the segments llla-lllc of the multi-level directory. 
Thus, frequently accessed directory elements are more likely 
to be stored in main memory 1106. The most popular 
elements appear in the highest and smallest segment Ilia of 
the directory, and will all be present in main memory 1106, 
Popularity of entries is tracked using a small counter that is 
several bits in length. This counter is updated as described 
in the section SCALED COUNTER UPDATING. This 
multi-level directory approximates the performance of 
in-memory hash tables, while providing cost-effective 
aggregate storage capacity for terabyte-sized caches, by 
placing inactive elements on disk. 

Directory Paging 

As discussed, in a preferred embodiment, the Directory 
Table 110 is implemented as a multi-level hash table. 
Portions of the Directory Table may reside out of main 
memory, on disk. Data for the Directory Table is paged in 
and out of disk on demand. A preferred embodiment of this 
mechanism uses direct disk I/O to carefully control the 
timing of paging to and from disk and the amount of 
information that is paged. 
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Another embodiment of this approach exploits a feature alternate records 123a-123w is a structure that stores infor- 

of UNIX-type operating systems to map files directly into mation that describes an alternative version of the requested 

virtual memory segments. In this approach, the cache maps object 52. For example the information describes a particu- 

the Directory Table into virtual memory using the UNIX lar browser version, a human language in which the object 

mmap( ) facility. For example, a mmap request is provided 5 has been prepared, etc. The alternate records also each store 

to the operating system, with a pointer to a file or disk a full object key that identifies an object that contains the 

location as a parameter. The mmap request operates as a alternative version. In the preferred embodiment, each of the 

request to map the referenced file or disk location to a alternate records 123tf-123n stores request information, 

memory location. Thereafter, the operating system automati- response information, and an object key 56. 
cally loads portions of the referenced file or disk location 10 Because a single popular object name may map to many 

from disk into memory as necessary. alternates, in one embodiment a cache composes explicit or 

Further, when the memory location is updated or implicit request context with the object name to reduce the 

accessed, the memory version of the object is written back number of elements in the vector. For example, the User- 

to disk as necessary. In this way, native operating system Agent header of a Web client request (which indicates the 

mechanisms are used to manage backup storage of the tables 15 particular browser application) may be concatenated with a 

in non-volatile devices. However, at any given time it is web URL to form the name key. By including contextual 

typical that only a portion of the Directory Table 110 is information directly in the key, the number of alternates in 

located in main memory. each vector is reduced, at the cost of more entries in the 

In a typical embodiment, the Directory Table and Open directory table. In practice, the particular headers and 

Directory are stored using a "striping technique. Each set of 20 implicit context concatenated with the information object 

the tables is stored on a different physical disk drive. For name is configurable. 

example, set UOa of Directory Table 110 is stored on storage These Vectors of Alternates 122a-I 22n support the cor- 

device 90a, set 110& is stored on storage device 1106, etc. In rect processing of HTTP/1.1 negotiated content. Request 

this arrangement, the number of seek operations needed for and response information contained in the headers of HTTP/ 

a disk drive head to arrive at a set is reduced, thereby 25 1.1 messages is used to determine which of the alternate 

improving speed and efficiency of the cache. records 123a-123/i can be used to satisfy a particular 

It should be noted when paging data between disk and request. When cache 80 receives requests for objects, the 
memory certain safeguards are taken to ensure that the requests typically contain header information in addition to 
information stored in memory is consistent with the corre- the n ame ( or URL ) of me desired object. As explained 
sponding information stored in a non-volatile storage above, the name is used to locate the appropriate Vector of 
device. The techniques used to provide efficient consistency Alternates. Once the appropriate Vector of Alternates is 
in object caches are summarized in the context of garbage found, the header information is used to select the appro- 
collection, in the section named SYNCHRONIZATION pnate alternate record for the request. 
AND CONSISTENCY ENFORCEMENT. 35 Specifically, in the cache 80, the header information is 

received and analyzed. The cache 80 seeks to match values 

Vector of Alternates found in the header information with request information of 

As mentioned above, it is possible for a single URL to one of me alternate records 123a-123k. For example, when 

map to an object that has numerous versions. These versions the cache 80 is used in the context of the World Wide Web, 

are called "alternates". In systems that do not use an object 40 requests for objects are provided to a server containing the 

cache, versions are selected as follows. The client 10a cacne in the f onn 01 HITF requests, 
establishes an HTTP connection to the server 40 through the The cache 80 examines information in an HTTP request 

Internet 20. The client provides information about itself in to determine which of the alternate records 123a-123rt to 

an HTTP message that requests an object from the server. use. For example, the HTTP request might contain request 

For example, an HTTP request for an object contains header 45 information indicating that the requesting client 10a is 

information that identifies the Web browser used by the running the Netscape Navigator browser program, version 

client, the version of the browser, the language preferred by 3.0, and prefers German text. Using this information, the 

the client, and the type of media content preferred by the cache 80 searches the alternate records 123a through 123/j 

client. When the server 40 receives the HTTP request, it for response information that matches the browser version 

extracts the header information, and selects a variant of the 50 and the client's locale from the request information. If a 

object 52 based upon the values of the header information. match is found, then the cache retrieves the object key from 

The selected alternate is returned to the client 10a in a the matching alternate and uses the object key to retrieve, the 

response message. This type of variant selection is promoted corresponding object from the cache, 
by the emerging HTTP/1.1 hypertext transfer protocol. The cache optimizes the object chosen by matching the 

It is important for a cache object store to efficiently 55 criteria specified in the client request. The client request may 
maintain copies of alternates for a URL. If a single object is specify minimal acceptance criteria (e.g. the document must 
always served from cache in response to any URL requests, be a JPEG image, or the document must be Latin). The client 
a browser may receive content that is different than that request may also specify comparative weighting criteria for 
obtained directly from a server. For this reason, each name matches (e.g. will accept a GIF image with weight 0.5, but 
key in the directory table 110 maps to one of the vectors of eo prefer a JPEG image at weight 0.75). The numeric weight- 
alternates 122a-122rt, which enable the cache 80 to select ings are accumulated across all constraint axes to create a 
one version of an object from among a plurality of related final weighting that is optimized. 

versions. For example, the object 52 may be a Web page and The object key is used to retrieve the object in the manner 

server 40 can store versions of the object in the English, described above. Specifically, a subkey portion of the object 

French, and Japanese languages. 65 key is used to initiate another search of the Tag Table 102 

Each Vector of Alternates 122a-122n is a structure that and the Directory Table 110, seeking a hit for the subkey 

stores a plurality of alternate records 123a-123n. Each of the value. If there is a hit in both the Tag and Directory Tables, 
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then the block in the Directory Table arrived at using the 
subkey values will always reference a stored object (e.g. 
stored object 124). Thus, using the Vector of Alternates 122, 
the cache 80 can handle requests for objects having multiple 
versions and deliver the correct version to the requesting 5 
client 10a. 

In FIG. 4, only one exemplary Vector of Alternates 122 
and one exemplary stored object 124 are shown. However, 
in practice the cache 80 includes any number of vectors and 
disk blocks, depending on the number of objects that are 10 
indexed and the number of alternative versions associated 
with the objects. 

Read Ahead 

FIG. 4B is a diagram showing a storage arrangement for 15 
exemplary Vectors of Alternates 122a-122«. The system 
attempts to aggregate data object contiguously after the 
metadata. Because seeks are time-consuming but sequential 
reads are fast, performance is improved by consolidating 
data with metadata, and p re-fetching data after the metadata. 20 

In one of the storage devices 90a-90n, each of the Vectors 
of Alternates 122a-122n is stored in a location that is 
contiguous to the stored objects 124a-124£> that are associ- 
ated with the alternate records 123a-123rt represented in the ^ 
vector. For example, a Vector of Alternates 122a stores 
alternate records 123a-123c. The alternate record 123a 
stores request and response information indicating that a 
stored object 124a associated with the alternate record is 
prepared in the English language. Another alternate record 3Q 
1236 stores information indicating that its associated stored 
object 1246 is intended for use with the Microsoft Internet 
Explorer browser. The stored objects 124a, 1246 referenced 
by the alternate records 123a, 1236 are stored contiguously 
with the Vectors of alternates 122a-l22n. 35 

The Size value 120 within each alternate record indicates 
the total size in bytes of one of the associated Vectors of 
Alternates 122a-122n and the stored object 124. When the 
cache 80 references a Vector of Alternates 122a based on the 
disk location value 118, the cache reads the number of bytes 40 
indicated by the Size value. For example, in the case of the 
Vectors of Alternates shown in FIG. 4B, the Size value 
would indicate the length of the Vector of Alternate 122a 
plus the length of its associated stored object 124a. 
Accordingly, by referencing the Size value, the cache 80 45 
reads the vector as well as the stored object. In this way, the 
cache 80 "reads ahead" of the Vector of Alternates 122 and 
retrieves all of the objects 50 from the storage devices 
90a-90/i. As a result, both the Vector of Alternates and the 
objects 50 are read from the storage device using a single 50 
seek operation by the storage device. Consequently, when 
there is a hit in the cache 80, in the majority of cases (where 
there is a single alternate) the requested object 52 is retrieved 
from a storage device using a single seek. 

When the disk location value 118 directly references a 55 
stored object 124, rather than a Vector of Alternates 122, the 
Size value 120 indicates the size of the object as stored in the 
disk block. This value is used to facilitate single-seek 
retrieval of objects, as explained further herein. 

60 

The Open Directory 

In one embodiment, the cache 80 further comprises an 
Open Directory 130. The Open Directory 130 stores a 
plurality of linked lists 132a-132/i, which are themselves 
composed of a plurality of list entries 131a-131 n. Each of 65 
the linked lists 132a-132n is associated with one of the sets 
110a-110n in the Directory Table 110. The Open Directory 


,319 Bl 

16 

130 is stored in volatile main memory. Preferably, each list 
entry 131a-131« of the Open Directory 130 stores an object 
key that facilitates associative lookup of an information 
object. For example, each item within each linked list 
132a-132n stores a complete object key 56 for an object 52. 

The Open Directory accounts for objects that are currently 
undergoing transactions, to provide mutual exclusion 
against conflicting operations. For example, the Open Direc- 
tory is useful in safeguarding against overwriting or deleting 
an object that is currently being read. The Open Directory 
also buffers changes to the Directory Table 110 before they 
are given permanent effect in the Directory Table 110. At an 
appropriate point, as discussed below, a synchronization 
operation is executed to move the changes reflected in the 
Open Directory 130 to the Directory Table 110. This pre- 
vents corruption of the Directory Table 110 in the event of 
an unexpected system failure or crash. 

Further, in one embodiment, when an object is requested 
from the cache 80, the Open Directory 130 is consulted first; 
it is considered the most likely place to yield a hit, because 
it contains references to the most recently used information 
objects. The Open Directory in this form serves as a cache 
in main memory for popular, data. 

Disk Data Layout and Aggregation 

After the Open Directory 130, Tag Table 102 and Direc- 
tory Table 110 have been accessed to determine the location 
of a stored object 124, the object must be read from storage 
and transmitted to the user that requested the object. To 
improve the efficiency of read operations that are used to 
retrieve objects 50 from the cache 80, certain data aggrega- 
tion techniques are used when initially storing the data. 
When data is initially stored on disk according to the data 
aggregation techniques described herein, the efficiency of 
subsequent reads is improved greatly. 

FIG. 6 is a block diagram of a data storage arrangement 
for use with the cache 80 and the storage devices 90a-90n. 
A storage device 90a, such as a disk drive, stores data in 
plurality of pools 200a-200n. A pool is a segment or chunk 
of contiguous disk space, preferably up to 4 Gbytes in size. 
Pools can be allocated from pieces of files, or segments of 
raw disk partitions. 

Each pool, such as pool 200/1, comprises a header 202 and 
a plurality of fixed size storage spaces referred to herein as 
"arenas" 204a through 204/t. The size of the arenas is 
preferably configurable or changeable to enable optimiza- 
tion of performance of the cache 80. In the preferred 
embodiment, each of the arenas 204a-204n is a block 
approximately 512 Kbytes to 2 Mbytes in size. 

Data to be written to arenas is staged or temporarily stored 
or staged in a "write aggregation buffer" in memory. This 
buffer accumulates data, and when full, the buffer is written 
contiguously, in one seek, to an arena on disk. The write 
aggregation buffer improves the performance of writes, and 
permits sector alignment of data, so data items can be 
directly read from raw disk devices. 

The write aggregation buffer is large enough to hold the 
entire contents of an arena. Data is first staged and consoli- 
dated in the write aggregation buffer, before it is dropped 
into the (empty) arena on disk. The write aggregation buffer 
also contains a free top pointer that is used to allocate 
storage out of the aggregation buffer as it is filling, an 
identifier naming the arena it is covering, and a reference 
count for the number of active users of the arena. 

Each pool header 202 stores a Magic number, a Version 
No. value, a No. of Arenas value, and one or more arena 
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headers 206a-206n. The Magic number is used solely for main memory of the workstation that runs the cache 80 

internal consistency checks. The Version No. value stores a stores an array of pointers to empty arenas. In alternate 

version number of the program or process that created the embodiments, additional information can be stored in the 

arenas 206a-206n in the pool. It is used for consistency header 206a-n of each arena. For example, the header may 

checks to ensure that the currently executing version of the 5 store values indicating the number of deleted information 

cache 80 can properly read and write the arenas. The No. of objects contained in the arena, and a timestamp indicating 

Arenas value stores a count of the number of arenas that are when garbage collection was carried out last on the arena, 

contained within the pool. Although three fragments are shown in FIG. 6 as an 

For each of the arenas in the pool, the pool header 202 example, in practice any number of fragments may be stored 

stores information in one of the arena headers 206a-206n. 1° in an arena until the capacity of the arena is reached. In 

Each arena header stores two one-bit values that indicate addition, the number of pools and the number of arenas 

whether the corresponding arena is empty and whether the shown in FIG. 6 are merely exemplary, and any number may 

arena has become corrupted (e.g. due to physical disk be used. 

surface damage, or application error). The above-described structure of the arenas facilitates 

As shown in FIG. 6 in the exemplary case of an arena 15 certain consistent and secure mechanisms of updating data 

204a, each arena comprises one or more data fragments for objects that are stored in fragments of the arenas. FIG. 7 

208a-208n. Each fragment 208a-208n comprises a frag- is a block diagram relating to updating one of the arenas 

ment header 208a* and fragment data 208e. The fragment 204a-204/i of FIG. 6. FIG. 7 shows an arena 204a contain- 

data 208e is the actual data for an object that is stored in the ing a first information object 208b having a header 206 and 

cache 80. The data for an entire stored object may reside 20 data fragments 208a-208c. Top pointer 210 points to the 

within a single fragment, or may be stored within multiple topmost active portion of the arena 204a, which is the end 

fragments that may reside in multiple arenas. The fragment of the data segment 208c. Preferably, the Directory Table is 

header 208a" stores a Magic number value 206c, a key value updated only after a complete information object has been 

206a and a length value 206b. written to an arena, including header and data, and only after 

Ttie length value 2066 represents the length in bytes of the 25 top pointer of the arena has been moved successfully. For 

fragment, including both the fragment header 208d and the example, a complete information object is written to the 

fragment data 208*. The key value 206a is a copy of the arena 204a above the top pointer 210, and the top pointer is 

object key, stored in its entirety, of the object whose data is moved to indicate the new top free location of the arena, 

in the fragment. Thus, the key value 206c can be used to look 3Q 0nl y then is ^ Directory Table updated, 

up the directory block that points to the first fragment that The delayed updating of the Directory Table is carried out 

holds data of the object whose data is contained in the to ensure that the Directory Table remains accurate even if 

fragment. * catastrophic system failure occurs during one of the other 

According to one embodiment, the complete object key steps. For example, if a disk drive or other element of the 

56 is stored in association with the last fragment associated 35 system crashes before completion of one of the steps, no 

with a particular object. When an object 52 is stored in the adverse effect occurs - In such a casc > thc arena 204fl ^ 

cache 80 for the first time, the object key 56 is computed ^ntain corrupt or incomplete data, but the cache 80 will 

incrementally as object data is read from the originating effectively ignore such data because nothing in the Directory 

server 40. Thus, the final value of the object key 56 cannot Table m > mdcxes or hash tables 1S referencing the corrupt 

be known until the entire object 52 is read. The object key „ data - In addition > "si* the Garbage Collection process 

56 is written at the end of the chain of fragments used to described herein, the corrupt or incomplete data is eventu- 

store the object, because the value of the key is not known a ^ v reclaimed. 

until the last fragment is written, and because modifying Multi-Fragment Objects 
existing data on disk is slow. In alternate embodiments, the 

fragment header can store other metadata that describes the 45 In FIG. 3, the directory table block 112b that is arrived at 

fragment or object. based on the object key of object 52 includes a pointer 

The write aggregation buffer contains a "free top pointer" di ^ ctl Y t0 the fragment in which the object 52 is stored. This 

210 indicating the topmost free area of the buffer 204a. The assumes that object 52 has been stored in a single fragment, 

top pointer 210 identifies the current boundary between used However, large objects may not always fit into a single 

and available space within the buffer 204a. The top pointer 50 fragment, for two reasons. First, fragments have a fixed 

210 is stored to enable the cache 80 to determine where to maximum size (preferred value is 32 KB). Objects greater 

write additional fragments in the buffer. Everything below than 32 KB will be fragmented. Second, the system must 

(or, in FIG. 6, to the left of) the top pointer 210 contains or pre-reserve space in the write aggregation buffer for new 

has already been allocated to receive valid data. The area of objects. If the object store does not know the size of the 

the arena 204a above the top pointer 210 (to the right in FIG. 55 incoming object, it may guess wrong. The server may also 

6) is available for allocation for other information objects. misrepresent the true (larger) size of the object. In both 

Preferably, each fragment includes a maximum of 32 kilo- cases, the object store would create a chain of fragments to 

bytes of data. Fragments start and end on standard 512-byte handle the overflow. 

boundaries of the storage device 90a. In the context of the Therefore, a mechanism is provided for tracking which 

World Wide Web, most objects are relatively small, gener- 6 o fragments contain data from objects that are split between 

ally less than 32K in size. fragments. FIG. 5 is a block diagram of a preferred structure 

Each arena may have one of two states at a given time: the for keeping track of related fragments, 

empty state or the occupied state. The current state of an For the purpose of explanation, it shall be assumed that an 

arena is reflected by the Empty value stored in each arena object X is stored in three fragments 208a, 2086 and 208c 

header 206a-206n. In the occupied state, some portion of 65 on storage devices 90a-90w. Using the object key for object 

the arena is storing usable data. A list of all arenas that are X, the cache traverses the Tag Table to arrive at a particular 

currently empty or free is stored in memory. For example, block 141a within the Directory Table 110. Block 141a is the 
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head of a chain of blocks that identify successive fragments 
that contain the object X. In the illustrated example, the 
chain is includes blocks 141a, 141b, 141c, 141d and 141e, 
in that order, and is formed by pointers 128a through 128d. 

According to one embodiment, the head block 141a 
comprises a sub key value 126 and a block pointer 128a. 
Preferably, the subkey value 126 is 96- bits in length and 
comprises a subset of the value of the object key 56 for 
object X. The value of the block pointer 128a references the 
next block 141/? in the chain. 

Directory table block 141b comprises a fragment pointer 
130a and a block pointer 1286. The fragment pointer 130a 
references a fragment 208a that stores the first portion of the 
data for the object X. The block pointer 128b of pointer 
block 141b references the next pointer block 141c in the 
chain. Like pointer block 141b, pointer block 141c has a 
fragment pointer 130b that references a fragment 208b. The 
block pointer 128c of pointer block 141c references the next 
pointer block 141a" in the chain. Like pointer block 141c, 
pointer block 141a" has a fragment pointer 130b that refer- 
ences a fragment 208c. 

The object store needs a mechanism to chain fragments 
together. Traditional disk block chaining schemes require 
modifying pre-existing data on disk, to change the previous 
chain -link pointers to point the new next block values. 
Modification of pre-existing disk data is time-consuming 
and creates complexities relating to consistency in the face 
of unplanned process termination. 

According to one embodiment of the invention, the need 
to patch new fragment pointers into extant fragments is 
removed by using "iterative functional pointers". Each frag- 
ment is assigned a key, and the key of the next fragment is 
assigned as a simple iterative function of the previous 
fragment's key. In this manner, fragments can be chained 
simply by defining the key of the next fragment, rather than 
by modifying the pointer of the previous fragment. 

For example, the block pointer 128a is computed by 
applying a function to the value of subkey 126. The block 
pointer value 128b is computed by applying a function to the 
value of the block pointer 128a. The function used to 
compute the pointer values is not critical, and many different 
functions can be used. The function can be a simple accu- 
mulating function such that 

key^-key^+l 

or the function can be a complex function such as the MD5 
hash function 

kcy„-AfD5(kcy„_0 

The only requirement is that the range of possible key values 
should be sufficiently large, and the iteration should be 
sufficiently selected, so that the chances of range collision or 
cyclic looping are small. In the very unlikely event of key 
collision, the object will be deleted from the cache. 

The last pointer block 141a" in the chain has a block 
pointer 128a* that points to a tail block 141e. The tail block 
141e comprises a reference to the first block 141a in the 
chain. According to one embodiment, the reference con- 
tained in the tail block 141e a 96-bit subkey 132 of the object 
key of object X. The cache can use the 96-bit subkey 132 to 
locate the head block 128a of the chain. The tail block 141e, 
and the looped pointer arrangement it provides, enables the 
cache 80 to locate all blocks in a chain, starting from any 
block in the chain. 

Three fragments 208a, 208b, and 208c are shown in FIG. 
5 merely by way of example. In practice, an information 
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object may occupy or reference any number of fragments, 
each of which would be identified by its own pointer block 
within the Directory Table 110. 

When the object 52 is read from the storage device, the 
5 last fragment is read first to ensure that the content MD5 key 
stored there matches the directory key value. This test is 
done as a "sanity check" to ensure that the correct object has 
been located. If there is no match, a collision has occurred 
and an exception is raised. 

10 

Space Allocation 

FIG. 10A is a flow diagram of a method of allocating 
space for objects newly entered into the cache and for 
writing such objects into the allocated space. The allocation 

15 and write method is generally indicated by reference 
numeral 640. Generally the steps shown in FIG. 10A are 
carried out when a miss has occurred in the Directory Table 
and Tag Table, for example, at step 898 of FIG. 8F. 
Accordingly, in step 642, an information object that has 

20 been requested by a client, but not found in the cache, is 
looked up and retrieved from its original location. In a 
networked environment, the origin is a server 40, a cluster, 
or a disk. When the object is retrieved, in step 644 the 
method tests whether the object is of the type and size that 

25 can be stored in the cache, that is, whether it is "cache able/' 
Examples of non-cacheable objects include Web pages 
that are dynamically generated by a server application, panes 
or portions of Web pages that are generated by client side 
applets, objects that are constructed based upon dynamic 

30 data taken from a database, and other non-static objects. 
Such objects cannot be stored in the cache because their 
form and contents changes each time that they are generated. 
If such objects were to be stored in the cache, they would be 
unreliable or incorrect in the event that underlying dynamic 
data were to change between cache accesses. The process 
determines whether the object is cacheable by examining 
information in the HTTP response from the server 40 or 
other source of the object. 

^ If the object is cacheable, then in step 646 the method 
obtains the length of the object in bytes. For example, when 
the invention is applied to the World Wide Web context, the 
length of a Web page can be included in metadata that is 
carried in an HTTP transaction. In such a case, the cache 

45 extracts the length of the information object from the 
response information in the HTTP message that contains the 
information object. If the length is not present, and estimate 
is generated. Estimates may be incorrect, and will lead to 
fragmented objects. 

50 As shown in block 648, space is allocated in a memory- 
resident write aggregation buffer, and the object to be written 
is streamed into the allocated buffer location. In a preferred 
embodiment, block 648 involves allocating space in a write 
aggregation buffer that has sufficient space and is available 

55 to hold the object. In block 650, the cache tests whether the 
write aggregation buffer has remaining free space. If so, the 
allocation and write process is complete and the cache 80 
can carry out other tasks. When the write aggregation buffer 
becomes full, then the test of block 650 is affirmative, and 

60 control is transferred to block 656. 

In block 656, the cache writes the aggregation buffer to 
the arena it is shadowing. In step 660, the Directory is 
updated to reflect the location of the new information object. 
The foregoing sequence of steps is ordered in a way that 

65 ensures the integrity of information objects that are written 
to the cache. For example, the Directory is updated only 
after a complete information object has been written to an 
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arena, including header and data. For example, if a disk 
drive or other element of the system crashes before comple- 
tion of step 652 or step 658, no adverse effect occurs. In such 
a case, the arena will contain corrupt or incomplete data, but 
the cache will effectively ignore such data because nothing 5 
in the indexes or hash tables is referencing the corrupt data. 
In addition, using the garbage collection process described 
herein, the corrupt or incomplete data is eventually 
reclaimed. 

10 

Garbage Collection 

FIG. 8A is a flow diagram of a method of garbage 
collection that can be used with the cache 80. FIG. 8B is a 
flow diagram of further steps in the method of FIG. 8A, and 
will be discussed in conjunction with FIG. 8A. Preferably, 15 
the garbage collection method is implemented as an inde- 
pendent process that runs in parallel with other processes 
that relate to the cache. This enables the garbage collection 
method to periodically clean up cache storage areas without 
interrupting or affecting the operation of the cache. 20 

1. General Process 

In the preferred embodiment, "garbage collection" gen- 
erally means a process of scanning target arenas, identifying 
active fragments or determining whether to delete 
fragments, writing the active fragments contiguously to new 
arenas, and updating the Directory Table to reference the 
new locations of the fragments. Thus, in a very broad sense 
the method is of the "evacuation" type, in which old or 
unnecessary fragments are deleted and active fragments are 3Q 
written elsewhere, so that at the conclusion of garbage 
collection operations on a particular arena, the arena is 
empty. Preferably, both the target arenas and the new arenas 
are stored and manipulated in volatile memory. When gar- 
bage collection is complete, the changes carried out in 35 
garbage collection are written to corresponding arenas 
stored in non-volatile storage such as disk, in a process 
called synchronization. 

In step 802, one of the pools 200a-200« is selected for 
garbage collection operations. Preferably, for each pool ^ 
200a-200« of a storage device 90a, the cache stores or can 
access a value indicating the amount of disk space in a pool 
that is currently storing active data. The cache also stores 
constant "low water mark** and "high water mark" values, as 
indicated by block 803. When the amount of active storage 4S 
in a particular pool becomes greater than the "high water 
mark" value, garbage collection is initiated and carried out 
repeatedly until the amount of active storage in the pool falls 
below the "low water mark" value. The "low water mark" 
value is selected to be greater than zero, and the "high water 
mark" value is chosen to be approximately 20% less than the 
total storage capacity of the pool. In this way, garbage 
collection is carried out at a time before the pool overflows 
or the capacity of the storage device 90a is exceeded. 

2. Usage -aware Garbage Collection 55 
In step 804, one of the arenas is selected as a target for 

carrying out garbage collection. The arena is selected by a 
selection algorithm that considers various factors. As indi- 
cated by block 805, the factors include, for example, 
whether the arena is the last arena accessed by the cache 80, 60 
and the total number of accesses to the arena. In alternate 
embodiments, the factors may also include the number of 
information objects that have been deleted from each arena, 
how recently an arena has been used, how recently garbage 
collection was previously carried out on each arena, and 65 
whether an arena currently has read or write locks set on it. 
Once the arena is selected for garbage collection, all of the 
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fragments inside the object are separately considered for 
garbage collection. 

In step 806, one of the fragments within the selected arena 
is selected for garbage collection. In determining which 
fragment or fragments to select, the cache 80 takes into 
account several selection factors, as indicated by block 807. 
In the preferred embodiment, the factors include: the time of 
the last access to the fragment; the number of hits that have 
occurred to an object that has data in the fragment; the time 
required to download data from the fragment to a client; and 
the size of the object of which the fragment is a part. Other 
factors are considered in alternate embodiments. Values for 
these factors are stored in a block 112^-112/1 that is asso- 
ciated with the object for which the fragment stores data. 

In block 808, the cache determines whether a fragment 
should be deleted. In the preferred embodiment, block 808 
involves evaluation of certain performance factors and opti- 
mization considerations. 

Caches are used for two primary, and potentially 
conflicting, reasons. The first reason is improving client 
performance. To improve client performance, it is desirable 
for a garbage collector to retain objects that minimize server 
download time. This tends to bias a garbage collector toward 
caching documents that have been received from slow 
external servers. The second reason is minimizing server 
network traffic. To minimize server traffic, it is desirable for 
a garbage collector to retain objects that are large. Often, 
these optimizations conflict. 

By storing values that identify the time required to 
download an object, the size of the object, and the number 
of times the object was hit in cache, the garbage collector 
can estimate, for each object, how much server download 
time was avoided and how much server traffic was disabled, 
by serving the cached copy as opposed to fetching from the 
original server. This metric measures the inherent "value" of 
the cached object. 

The cache administrator then configures a parameter 
between 0 and 1, indicating the degree to which the cache 
should optimize for time savings or for traffic savings. The 
foregoing values are evaluated with respect to other objects 
in the arena, with respect to the amount of space the object 
is consuming, and with respect to objects recently subjected 
to garbage collection. Based on such evaluation, the cache 
80 determines whether to delete the fragment, as shown in 
step 808. 

If the fragment is to be deleted, then in step 812 it is 
deleted from the arena by marking it as deleted and over- 
writing the data in the fragment. When an object 52 is stored 
in multiple fragments, and the garbage collection process 
determines that one of the fragments is to be deleted, then 
the process deletes all fragments associated with the object. 
This may involve following a chain of fragments, of the type 
shown in FIG. 5, to another arena or even another pool. 

If the fragment is not to be deleted, then in step 810 the 
fragment is written to a new arena. FIG. 8B, which is 
discussed below, shows preferred sub -steps involved in 
carrying out step 810. 

After the fragment is deleted or moved to another arena, 
in step 814 the Directory Table 110 is updated to reflect the 
new location of the fragment. Step 814 involves using the 
value of the key 206a in the fragment header 208a* associ- 
ated with a fragment 208/1 to be updated to look up a block 
112a-112/i that is associated with the fragment. When the 
correct Directory Table block 112a-112/i is identified, the 
disk location value 118 in the block is updated to reflect the 
new location of the fragment. If the fragment has been 
deleted, then any corresponding Directory Table entries are 
deleted. 
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Step 816 indicates that the method is complete after the slow client, main memory would be tied up unnecessarily. 

Directory Table 110 is updated. However, it should be Buffered I/O using these mechanisms tends to reduce the 

understood that the steps of FIG. 8A are carried out for all number of sequential read and write operations that are 

pools, all arenas within each pool, and all fragments within carried out on a disk. 

each arena. 5 5. Synchronization and Consistency Enforcement 

3 Writing Fragments to New Arenas Regularly during the garbage collection process and dur- 

F1G. 8B is a flow diagram of steps involved in carrying ^ operation of the cache 80 a synchronization process is 

out step 810, namely, writing a fragment that is to bl earnest, synchronization process comrmts changes 

y . ' r™ r -.* , i reflected in the Open Directory 130 to the Directory Table 

preserved to a new arena. The process of writing evacuated u0 afld tQ s ^ f ^ ^ non . volatile sto ' m one 

fragments to new arenas is completely analogous to writing Qf morc of mc stQragc dcyices 90a _ 90n ^ goa] is ^ 

ongmal fragments. The data is written into a write aggre- maiDtain the consistency of the data on disk at all times. That 

gation buffer, and dropped to disk arenas when full. ^ at any g wen instant the state of the data structures on disk 

In step 590, the directory tables are updated to reflect the ^ \00% consistent and the cache can start up without 

change in location of the fragment. In the preferred requiring checking. This is accomplished through careful 

embodiment, step 590 involves writing update information 15 ordering of the writing and synchronization of data and 

in the Open Directory 130 rather than directly into the meta-data to the disk. 

Directory Table 110. At a later time, when the process can For the purposes of discussion, in this section, 'data' 

verify that the fragment data 208e has been successfully re f ers to the actual objects the cache is being asked to store, 

written to one of the storage devices 90a-90/i, then the For instance, if the cache is storing an HTML document, the 

changes reflected in the Open Directory 130 are written into data is the document itself. 'Meta-data 1 refers to the addi- 

or synchronized with the Directory Table 110. ti 0 nal information the cache needs to store in order to index 

This process is used to ensure that the integrity of the the 'data' so that it can be found during a subsequent lookup( 

Directory Table 110 is always preserved. As noted above, ) operation as well as the information it needs to allocate 

buffered storage is used for the fragments; thus, when a ^ space for the *data\ The 'meta-data' is comprises the direc- 

fragment is updated or a new fragment is written, the tory and the pool headers. The directory is the index the 

fragment data is written to a buffer and then committed to a cache uses for associating a key (a name) with a particular 

disk or other storage device at a future time. Thus, during location on disk (the data). The cache uses the pool headers 

garbage collection, it is possible that a fragment that has to keep track of what disk space has been allocated within 

been moved to a new arena is not actually written on one of 3Q the cache. 

the storage devices when the garbage collection process is The cacne uses two rules to maintain the consistency of 

ready to update the Directory Table. Therefore, information the data structures on disk. The first rule is that meta-data is 

about the change is stored in the Open Directory 130 until always written down after the data it points to. The rationale 

the change is committed to disk. f or the first rule is that the cache has no "permanent* ' 

In step 592, the original arena is examined to test whether 35 knowledge of an object being in the cache until the meta- 

it has other fragments that might need to be reclaimed or data is written. If the cache were to write down the meta-data 

moved to a new arena. If other objects are present, then before the data and then crash, the meta-data would asso- 

control returns to step 806 of FIG. 8A, so that the next object ciate an object name with invalid object data on disk. This 

can be processed. If no other objects are present in the is undesirable, since the cache would then have to use 

current arena, then in step 594, the top pointer of the current ^ heuristics to try and determine which meta-data points to 

arena is reset. good data and which points to bad. 

4. Buffering The second rule is that a pool arena cannot be marked as 

In the preferred embodiment, read and write operations empty in the pool header until all the directory meta-data 

carried out by the cache 80 and the garbage collection that points to the arena has been deleted and written to disk, 

process are buffered in two ways. 45 This is necessary so that a crash cannot cause an empty arena 

First, communications between the cache 80 and a client to exist for which directory meta-data points to it. The 

10a that is requesting an object from the browser are problem this can cause is that the empty arena can become 

buffered through a flow-controlling, streaming, buffering filled with new data, since it is empty and therefore it is 

data structure called a VConnection. In the preferred available for new data to be written into it. However, "old" 

embodiment, the cache 80 is implemented in a set of 50 directory meta-data points to the same location as the new 

computer programs prepared in an object-oriented program- data. It is possible for accesses to the old directory meta-data 

ming language. In this embodiment, the VConnection is an to return the new data instead of either returning the old data 

object declared by one of the programs, and the VConnec- or failing. 

tion encapsulates a buffer in memory. Preferably, the buffer FIG. 8C is a flow diagram of a preferred synchronization 

is a FIFO buffer that is 32 Kbytes in size. 55 method 820 that implements the foregoing two rules. In 

When a client lOa-lOc connects to the cache 80, the block 822, an object is written to the cache. Block 822 

cache assigns the client to a VConnection. Data received involves the steps of block 824 and block 826, namely, 

from the client 10a is passed to the cache 80 through the creating metadata in the Open Directory, and writing and 

VConnection, and when the cache needs to send information syncing the object data to disk. 

to the client 10a, the cache writes the information to the 60 The steps of blocks 828 through 820' are carried out 

VConnection. The VConnection regulates the flow of data periodically. As indicated in block 828, for each piece of 

from the cache 80 to match the data transmission speed used meta-data in the open directory table, a determination is 

by the client 10a to communicate with the cache. In this way, made whether the data that the metadata points to is already 

use of the VConnectioD avoids an unnecessary waste of synchronized to disk, as shown in block 821. If so, then in 

main memory storage. Such waste would arise if an object 65 block 823, the cache copies the metadata that points to the 

being sent to the client 10a was copied to memory in its stable data from the Open Directory to the Directory Table, 

entirety, and then sent to the client; during transmission to a In block 825, the changes are synchronized to disk. 
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In block 827, garbage collection is carried out on an arena. In the preferred embodiment, each of the Read Counter 

Block 827 may involve the steps shown in FIG. 8A. values stored in blocks 112a-112/i is stored in three bit 

Alternatively, garbage collection generally involves the quantities. During operation of the cache 80, when a block 

steps shown in block 829, block 831, and block 820'. As is accessed, the Read Counter value of the block is incre- 

shown in block 829, for each fragment in the arena, the 5 mented by one. The highest decimal number that can be 

cache deletes the directory metadata that points to the represented by a three-bit quantity is 7. Accordingly, a Read 

segment, and writes the directory metadata to disk. In block Counter could overflow after being incremented seven 

831, the pool header is modified in memory such that the times. To prevent counter overflow, while enabling the 

arena is marked as empty. In block 820', the pool header is counters to track an unlimited number of operations that 

written and synced to disk. 10 increment them, the method of FIG. 10B is periodically 

The steps that involve writing information to disk pref- executed, 

erably use a "flush" operation provided in the operating The following discussion of the steps of FIG. 10B will be 

system of the workstation that is running the cache 80. The more c i ear i y understood with reference to Table 1; 
"flush" operation writes any data in the buffers that are used 

to store object data to a non-volatile storage device 90a-90c, 15 TABLE 1 

Using the foregoing methods, the Directory Table is not 
updated with the changes in the Open Directory until the 
data that the changes describe is actually written to disk or 
other non-volatile storage. Also, the cache 80 postpones 
updating the arenas on disk until the changes undertaken by 20 
the garbage collection process are committed to disk. This 
ensures that the arenas continue to store valid data in the 
event that a system crash occurs before the Directory Table 
is updated from the Open Directory. 

6. Re-validation 25 

In the preferred embodiment, the cache provides a way to 

re-validate old information objects in the cache so that they i n Table 1, the EVENT column identifies successive 

are not destroyed in the garbage collection process. events affecting a set of counter values, and briefly indicates 

FIG. 12 is a flow diagram of a preferred re- validation 30 the nature of the event. The COUNTERS heading indicates 

process. In block 1202, an external program or process three counter values A, B, and C represented in separate 

delivers a request to the cache that asks whether a particular columns. Each of the counter values A, B, C corresponds to 

information object has been loaded by a client recently. In a counter value that is stored in a different block 112a-112n 

response to the request, as shown in block 1204, the cache of the Directory Index 110. Thus, each row of Table 1 

locates the information object in the cache. In block 1206, 35 indicates the contents of three counter values at successive 

the cache reads a Read Counter value associated in the snapshots in time. 

directory tables with the information object. In block 1208, Evem 1 Qf ^ t ents an ar5it startin ^ in 

the cache tests whether the Read Counter value is high. ^ ^ whicfa ^ hasfa ^ ^ainiiig the counter 

If the Read Counter value is high, then the information values A, B, C each have been accessed once. Accordingly, 

object has been loaded recently. In that case, in block 1210 40 me value of each counter A, B, C is one. At event 2, the 

the cache sends a positive response message to the request- cache has accesse d the hash table entry that stores counter 

ing process. Otherwise, as indicated in block 1212, the value A Accordingly, counter A has been incremented and 

information object has not been loaded recently. its value is 2; the other counters B, C are unchanged. Assume 

Accordingly, as shown in block 1214, the cache sends a that scvera ] omer hash table entry accesses then occur, each 

negative responsive message to the calling program or 45 0 f wn ich causes one of counters A, B, or C to be incre- 

process. In block 1216, the cache updates an expiration date mented. Thereafter, at event 3, the values of the counters A, 

value stored in association with the information object to B> c m 7? 3> and x reS p e ctively. Thus, counter A is storing 

reflect the current date or time. By updating the expiration me ma xirmim value it can represent, binary 111 or decimal 

date, the cache ensures that the garbage collection process 7> md ^ overflow if an attempt is made to increment it to 

will not delete the object, because after the update it is not 50 a va i ue greater than 7. 

considered old. In this way, an old object is refreshed in the . . 

cache without retrieving the object from its origin, writing it At thls P oult > * e method of FIG 10B 15 a PP Ued to te 

in the cache, and deleting a stale copy of the object. counters A > B > C In ste P 622 > lhe value of ^ cou ° ters 

is read. In step 624, the sum of all the counter values is taken. 

Scaled Counter Updating ss In the case of Table 1, the sum is given by 7+3+1=11. In step 

FIG. 10B is a flow diagram of a method of scaled counter 626 ' ,he ™«™ «™ ™» be represented by all the 

updating. In the preferred embodiment, the method of FIG. counteI » » bascd u f° n the length ,n bits of the 

10B is used to manage the Read Counter values that are counter values. In the case of - three-bit value me maximum 

stored in each block Wll2« of a set of the Directory value ° f ° ne c ° u ( ntcr "* 7 and ^V^^T ^t 01 *° s ™ 

Table, as shown in FIG. 3A. However, the method of FIG. 60 o£ lb ™ threc - b 1 lt c ? un,els 15 7x3=2 V Alternatively, step 626 

10B is not limited to that context. The method of FIG. 10B can be °™" ed > th * PJ"™™^ Va U ° Can , be ^.A™ 

is applicable to any application that involves management of ""^ hat 15 avadabl ,f ,0 ^ ^ counter method 620 

each of a plurality of objects that has a counter, and in which and ^P 1 * retneved when needed 

it is desirable to track the most recently used or least recently In step 628, the method computes the value (maximum_ 

used objects. A key advantage of the method of FIG. 10B in 65 value/2), truncating any remainder or decimal portion, and 

comparison to past approaches is that it enables large compares it to the sum of all the counters. In the example 

counter values to be tracked in a small storage area. above, the relationship is 
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Sum=ll 

MaxLmum__ Value-2 1 
MaxLmum__ Value/2-10 

5 

(Sum>Maximum_Value/2>TRUE 

Since the result is true, control is transferred to step 630, in 
which all the counter values are decremented by 1 . The state 
of counters A, B, C after this step is shown by Event 4, 
"Decrement." Note that counter C, which represents the 10 
least recently used hash table entry, has been decremented to 
zero. At this point, least recently used hash table entries can 
be reclaimed or eliminated by scanning the corresponding 
counter values and searching for zero values. The result of 
this step is indicated in Event 5 of Table 1, "Reclaim/' The 35 
values of counters A and B are unchanged, and the value of 
counter C is undefined because its corresponding hash table 
entry has been deleted from the hash table. 

When the method of FIG. 10B is repeated periodically 
and regularly, none of the plurality of counter values will 20 
overflow. Also, least recently used entries are rapidly iden- 
tified by a counter value of zero, and can be easily eliminated 
from the cache. Counter values can be maintained in few bits 
even when hash table entries are accessed millions of times. 
Thus, the method of FIG. 10B provides a fast, efficient way 25 
to eliminate least recently used entries from a list. 

Cache Operations 

In the preferred embodiment, the cache 80 is implemented 3Q 
in one or more computer programs that are accessible to 
external programs through an API that supports read and 
write operations. The read and write operations are carried 
out on the Open Directory 130, which is the only structure 
of the cache 80 that is "visible" to external programs or ^ 
processes. The read operation is invoked by an external 
program that wants to locate an object in the cache. The 
write operation is invoked by a program that wants to store 
an object in the cache. Within the programs that make up the 
cache 80, operations called lookup, remove, checkout, and 
checkin are supported. The lookup operation looks up an 
object in the Open Directory based upon a key. The remove 
operation removes an object from the Open Directory based 
upon a key. The checkout operation obtains a copy of a block 
from the Directory Table 110 in an orderly manner so as to ^ 
ensure data consistency. The checkin operation returns a 
copy of a block (which may have been modified in other 
operations) to the Directory Table 110. In other 
embodiments, a single cache lookup operation combines 
aspects of these operations. 5Q 

1. Lookup 

In an alternate embodiment, a LOOKUP operation is used 
to determine whether a particular object identified by a 
particular name is currently stored in the cache 80. FIG. 9A 
is a flow diagram of steps carried out in one embodiment of 55 
the LOOKUP operation, which is generally designated by 
reference numeral 902. The LOOKUP operation is initiated 
by a command from the protocol engine 70 to the cache 80 
when a request message from a client 10a seeks to retrieve 
a particular object from the server 40. The request message go 
from the client 10a identifies the requested object by its 
name. 

When the process is applied in the context of the World 
Wide Web, the name is a Uniform Resource Locator (URL). 
In step 904, the cache 80 converts the name of the object to 65 
a key value. In the preferred embodiment, the conversion 
step is carried out as shown in FIG. 3B. The object name 53 
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or URL is passed to a hash function, such as the MD5 
one-way has function. The output of the hash function is an 
object name key 62. The object name key 62 can be broken 
up into one or more subkey values 64, 66. 

In step 906, the cache 80 looks up the request key value 
in the Open Directory 130. The Open Directory is consulted 
first because it is expected to store the most recently 
requested objects and therefore is likely to contain the object 
in the client request. Preferably, step 906 involves using one 
of the subkey values as a lookup key. For example, a 17-bit 
or 18-bit subkey value can be used for the lookup. 

In step 908, the cache 80 tests whether the subkey value 
has been found in the Open Directory. If the subkey value 
has been found in the Open Directory, then in step 910 the 
cache 80 retrieves the object from one of the storage devices, 
and delivers the object to the client. The retrieval sub-step 
involves the sub-steps described above in connection with 
locating objects in pools, arenas, and fragments of non- 
volatile storage in the storage devices 90a-90c. The delivery 
sub-step involves constructing an HTTP response to the 
client that includes data of the object, opening an HTTP 
connection to the client, and sending the HTTP request to the 
client. 

If the subkey value is not found in the Open Directory, 
then in step 912, the cache 80 looks up the request subkey 
value in the Tag Table 102. In step 914, the cache 80 tests 
whether the subkey value was found in the Tag Table 102. 
If no match was found, then in step 916 the cache 80 stores 
information about the fact that no match occurred, for later 
use as described below. The information can be a bit 
indicating that a miss in the Tag Table 102 occurred. 

In step 918, the cache 80 looks up the subkey value in the 
Directory Table. If the test of step 914 was affirmative, then 
the cache 80 retrieves a subkey value matching the request 
subkey value from one of the entries 106 of the tag Table 
102. Its value is used as a key to look up the request key 
value in the Directory Table. In step 920, the cache 80 tests 
whether the request key value was found in the Directory 
Table. If a hit occurs, and there was a miss in the Tag Table 
as indicated by the information stored in step 916, then in 
step 922 the cache 80 updates the Open Directory with 
information related to the Directory Table hit. Control is 
then passed to step 910 in which the object is obtained and 
delivered to the client in the manner described above. 

If the test of step 920 is negative, then the requested object 
is not in the cache, and a cache miss condition occurs, as 
indicated in step 924. In response to the miss condition, in 
step 926 the cache 80 obtains a copy of the requested object 
from the server that is its source. For example, in the Web 
context, the cache 80 opens an HTTP connection to the URL 
provided in the client's request, and downloads the object. 
The object is then provided to the client and stored in the 
cache for future reference. 

In a preferred embodiment, the LOOKUP operation is 
implemented as a method of an object in an object-oriented 
programming language that receives a key value as a param- 
eter. 

2. Cache Open Read Process 

FIG. 9E is a flow diagram of a preferred process of 
reading an object that is identified by an object name (such 
as a URL) from the cache. In the preferred embodiment, the 
process of FIG. 9E is called "open_read," and represents the 
sole external interface of the cache 80. It is advantageous, to 
ensure control and consistency of data in the cache, to enable 
external programs to access only operations that use or 
modify the Open Directory 130. Preferably, the process of 


01/28/2003, EAST Version: 1.03.0002 


US 6,453319 Bl 

29 30 

FIG. 9E is implemented as a program or programmatic Vector of Alternates. Duplicate alternate records are permit- 
object that receives an object name, and information about ted; the Vector of Alternates can contain more than one 
the user's particular request, as input parameters. The read alternate record that contains the same request and response 
process returns a copy of an object associated with a key that information. Testing existing alternate records to identify 
is found in the cache using the lookup process. Thus, the 5 duplicates is considered unnecessary because only a small 
read process, and other processes that are invoked or called incremental amount of storage is occupied by duplicate 
by it, are an alternative to the LOOKUP operation described alternate records. 

above in connection with FIG. 9A. In ste P 978 > ^ modified vector is checked into the cache 

T.rt^., , , , An . using the steps described above. In step 980, the object is 

In step 964, the process checks out a Vector of Alternates •*? « c *u j * < j * nn nn • *t_ 

. . v \ . . . j n_ * ui . m written to one of the data storage devices 90a-90c in the 

so that a teraates in the vector can be read. Preferably, step 10 ^ ^ , f ^ ^ 

964 involves invoking the checkout_read process described found t0 be in use durf st 6 980 th / Q the wfite ^ 

herein in connection with FIG. 8D, providing a key derived Ms ^ avoids overwriting an object identified by a key 

from the object name as a parameter. Checking out a vector (j, at ^ being updated, 

involves checking out a block from the Open Directory that 4 cache Update Process 

has a pointer to the vector, and reaching the block from the is p j G 9Q ig & flow diagfam of a cache up(Jate process ^ 

cacne * update process is used to modify a Vector of Alternates to 

If the checkout operation is successful, then in step 966 stor e different request information or response information, 

the process uses the request information to select one of the Generally, the update process is invoked by the protocol 

alternates from among the alternates in the vector. This engine 70 when the cache 80 is currently storing an object 

selection is carried out in the manner described above in 20 52 that matches a request from a client 10«, but the protocol 

connection with the Vector of Alternates 122. In an engine determines that the object has expired or is no longer 

embodiment, the selection operation is carried out by valid. Under these circumstances, the protocol engine 70 

another program or programmatic object that returns a opens an HTTP transaction to the server 40 that provided the 

success/failure indication depending upon whether a suit- original object 52, and sends a message that asks the server 

able alternate is located. If the selection is successful, then 25 whether the object has changed on the server. This process 

in step 968 the process checks the Vector of Alternates back & ^Ued "revalidation" of the object 52. If the server 40 

in. In step 970, the process reads the object that is pointed responds in the negative, the server will provide a short 

to by the selected alternate. Hi IP message with a header indicating that no change has 

If step 964 or step 966 results in failure, then the requested occurred, and providing new response information. In that 

document does not exist in the cache. Accordingly, in step case, the protocol engine 70 invokes the cache update 

972 the process returns a "no document" error message to process in order to move the new response information 

the calling program or process. about the object 52 into the cache 80. 

3. Cache Open Write Process If the server 40 responds affirmatively that the object 52 

FIG. 9F is a flow diagram of a process of writing an object 35 has changed since its expiration date or time in the cache 80, 

into the cache. As in the case of the read process described then the update process is not invoked. Instead, the server 40 

above in connection with FIG. 9E, the write process pref- returns a copy of the updated object 52 along with a new 

erably is implemented as an "open_write" method that is expiration date and other response information. In that case, 

the sole interface of the cache 80 to external programs the protocol engine 70 invokes the cache write process and 

needing to store objects in the cache. Preferably, the process ^ the create processes described above to add the new object 

of FIG. 9F is implemented as a program or method that 52 to the cache 80. 

receives an object name, request information, and response As shown in FIG. 9G, the update process receives input 

information as input parameters. The object name identifies parameters including an object name, an "okt" identifier, 

an object to be written into the cache; in the preferred request information, and response information. The object 

embodiment, the object name is a name key 62 derived from 45 name is a URL or a key derived from a URL. The request 

a URL using the mechanism shown in FIG. 3B. information and response information are derived from the 

The write process is initiated when a client 10a has client's HTTP request for the object 52 from the cache 80, 

requested an object 52 from the cache 80 that is not found and from the response of the server 40 when the cache 

in the cache. As a result, the cache 80 opens an HTTP obtains an updated copy of the object from the cache, 

transaction with the server 40 that stores the object, and 50 The "old" identifier is a value that uniquely identifies a 

obtains a copy of the object from it. The request information pair of request information and response information. In the 

that is provided to the cache write process is derived from preferred embodiment, when a cache miss causes the cache 

the HTTP request that came from the client. The response 80 to write a new object into the cache, information from the 

information is derived from the response of the server 40 to client request is paired with response information from the 

the cache 80 that supplies the copy of the object. 55 server that provides a copy of the object. Each pair is given 

In step 974, the process checks out a Vector of Alternates. a unique identifier value. 

This step involves computing a key value based upon the In step 986, the process checks out a Vector of Alternates 

object name, looking up a set and a block in the Open corresponding to the object name from the cache. Preferably, 

Directory that map to the key value, and locating a Vector of this is accomplished by invoking the checkout_ write pro- 

Alternates, if any, that corresponds to the block. If no vector eo cess described herein. This involves using the object name 

exists, as shown in step 984, a new vector is created. or URL to lookup an object in the Open Directory, the Tag 

If a vector is successfully checked out or created, then in Table, and the Directory Index, so that a corresponding 

step 976 the process uses the request information to define Vector of Alternates is obtained. If the checkout step fails, 

a new alternate record 123a-123/i withio the current alter- then in step 996 the process returns an appropriate error 

nate. Tne new alternate record references the location of the 65 message. 

object, and contains a copy of the request information and If the checkout is successful, then in step 988 a copy or 

the response information. The new alternate is added to the clone of the vector is created in main memory. A request/ 
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response identifier value is Located within the vector by herein in connection with the check-in process (steps 938 

matching it to tbe Old Identifier value received as input to and 944 of FIG. 9B), when the deletion flag is set, the block 

the process. The old identifier value is removed and a new will be marked as deleted. Thereafter, the block is eventually 

identifier is written in its place. Hie new identifier uniquely removed from the Directory Index when the changes 

identifies the new request and response information that is 5 reflected in the Open Directory are synchronized to the 

provided to the process as input. Directory Index. 

In step 990, the new vector is written to one of the storage 1 ' Cfl ^ckout Read Operation 

devices 90a-90c, and in step 992 the new vector is checked u FI . G - 8E > 15 a flow diagram of a checkout__read operation 

in to the cache. In carrying out these steps, it is desirable to * al f used m connection with the Directory Table 110. Hie 

completely write the clone vector to the storage device 10 f eck ° ut - read ^f 1 ™ " to obt^ a cc^y of t btock 

u r .u • t_ ij- tt.- tL a 4 f ... from the Directory Table 110 that matches a particular key. 

before the vector is checked in This ensures that the writing Qncc mc block ^ from ^ Difec £ * 

operation is successful before the directory tables are modi- ^ ^ ^ be fead and used fe ^ fis ^ checked 

fied to reference he clone vector. It also ensures that the old it outj but by QQ othef process hereafter, to make me block 

vector is available to any process or program that needs to available to other processes, the block is checked back in. 

access it. 15 c om pi emei rt ar y checkout check-in processes are used in 

5. Directory Lookup order to ensure that only one process at a time can modify 
FIG. 9C is a flow diagram of a preferred embodiment of a Directory Table block, a mechanism that is essential to 

a process of looking up information in the Open Directory ensure mat the Directory Table always stores accurate infor- 

130. The process of FIG. 9C is implemented as a program 2Q mation about objects in the cache. Thus, it will be apparent 

process or method that receives a subkey portion of a name tnat tne checkout and check-in processes is a primitive 

key 62 as an input parameter. In preceding steps that are not process that assists in searching the cache for a particular 

shown, it will be understood that the protocol engine 70 object. 

receives an object name, such as a URL. For example, a As indicated in FIG. 8D, the checkout__read operation 

URL is provided in an HTTP request issued by a client to a ^ receives a key value as input. In the preferred embodiment, 

server that is operating the cache. The protocol engine 70 the input key value is a subkey portion of a name key 62 that 

applies a hash function to the object name. The hash function corresponds to an object name. 

yields, as its result or output, a name key that identifies a set Because the object store will be modifying portions of 

in the cache. memory and disk data structures, it needs to guarantee a 

In step 948, the process attempts to check out one or more 30 brief period of mutual exclusion to a subset of the cache data 

blocks that are identified by the subkey from the Directory structures in order to achieve consistent results. The cache 

Index. The block checkout step preferably involves invoking data structures are partitioned into 256 virtual "slices", 

the checkout_read process described herein. Thus, selected by 8 bits of the key. Each slice has an associate 

If the checkout attempt results in a failure state, then in mut£ f lock " In ste P 83 ^ * e process seeks to obtain the lock 

step 950 the process returns an error message to the program 35 &r the input key. If a lock cannot be obtained the process 

or process that called it, indicating that a block matching the waits * e ^ Ume ^ 11 becomes availa ^ e * Aloc u k can b ^ 

input subkey was not found in the cache. Control is passed available if another transaction is modifying the small 

to step 952 in which the process concludes. about of memor y state associated ^ a ke Y that falls ™ 

Ti. . i 1 . <* , , p same slice. 

If the checkout attempt is successful, then a copy of a „„ . . . . . . , . . . , „ 

■ i -i i -i I* r u iL n* t at* When a lock is obtained, the input key becomes unavail- 

block becomes available for use by the calling program. In 40 L1 - , % . 

* aeA *u ui i ,l * u i j * • u 1 j ■ • a °l e f° r use DV other processes. In step 834, the process 

step 954, the block that was checked out is checked in again. ,f, £ 4l _ * ' _ f\ 

i . neve *u *, * *u ii- determines which set HOa-UOn of the Directory Table 110 

In step 956, the process returns a message to the calling . . , m, , 4 J - 

• j - *■ jl t it_ . j if i r j corresponds to the key. The process then locates one of the 

program indicating that the requested block was found. L1 , 4 ™, c , u ^ ~. , iU , 

£ • i j * * nc3 block lists 132a, 132£> of the Open Directory 130 that 

Processing concludes at step 952. , . . e * . „ , ; 1in . 

& r 45 corresponds to the set of the Directory Table 110, by 

Thus, a cache search operation involves calling more associating the value of a subkey of the input key with one 

primitive processes that seek to check out a block identified of me blodt lists In step ^ the process scaQS me blocks 

by a key from the Open Directory. If the primitives do not m the se ] ecled D i ock of me open Directory 130, seeking 

find the block in the Open Directory, the Directory Index is a match of me input key to a key stored m one of me blocks 
search ed 

50 If a match is found, then in step 838 the process tests 
When a block is found, it is delivered to the client. For whether the matching block is currently in the process of 
example, when the invention is applied to the World Wide crea ted or destroyed by another process. If the match- 
Web context, the data block is delivered by opening an mg b i ock j s currently in the process of being created or 
HTTP connection to the client and transmitting the data destroyed, then in step 840 an error message is returned to 
block to the client using an HTTP transaction. This step may ss me protocol engine 70 indicating that the current block is not 
involve buffering several data blocks before the transaction available. 

is opened. 0n the omer hand ^ if thc matcnmg block is not currently 

6. Cache Remove Process in the process of being created or destroyed, then the block 
FIG. 9D is a flow diagram of a process of removing a can be used. Accordingly, in step 842 the process increments 

block relating to an object from the cache. As in the case of 60 a read counter. The read counter is an internal variable, 

the checkout operations, the cache remove process receives associated with the block, that indicates the number of 

a key value as input. The process comprises steps 958 to processes or instances of programmatic objects that are 

962. These steps carry out operations that are substantially reading the block Such processes or objects are called 

similar to the operations of steps 948, 954, and 952 of FIG. "readers." In step 844, the process obtains a copy of the 

9C. To accomplish removal of a block found in the cache, 65 block, and returns it to the calling program or process, 

however, in step 960 the process sets the deletion flag, and If a match is not found in the scan of step 836, then in step 

checks the block in with the deletion flag set. As described 846, the process invokes a search of the Directory Table, 
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seeking a match of the key to a set and block of the Directory 
Table using a process that is described further herein. If no 
match of the key is found in the search, then in step 848 the 
process returns an error message to the calling program or 
process, indicating that the requested obj ect does not exist in 5 
the cache. Although the specific response to such a message 
is determined by the calling program or process, in the 
World Wide Web context, generally the proxy 30 contacts 
the server 40 that stores the object using an HTTP request, 
and obtains a copy of the requested object. 10 

If a match is found during the Directory Index lookup of 
step 846, then in step 850 a corresponding block is added to 
the Open Directory. This is carried out by creating a new 
Open Directory block in main memory; initializing the block 
by copying information from the corresponding Directory 15 
Index block; and adding a reference to the new block to the 
corresponding list of blocks 132a, 132/?. 

8. Checkout Write Operation 

FIG. 8E is a flow diagram of a checkout_write process or 
operation that is used in connection with the Open Directory 20 
130. The checkout_write operation is used to obtain a copy 
of a block from the Open Directory 130 that matches a key 
that is passed to the process, for the purpose of modifying or 
updating the contents of the block, or an object or vector that 
is associated with the block. Once a block is checked out of 25 
the Open Directory 130 using checkout_write, other pro- 
cesses can modify the block or its associated object or 
vector. The block is then checked back in using the checkin 
process described herein. Using these operations, changes 
are stored in the Open Directory and then propagated to the 
Directory Table in an orderly manner. 

As indicated in FIG. 8E, the checkout_write process 
receives a key value as input. In the preferred embodiment, 
the input key value is a subkey portion of a name key 62 that 35 
corresponds to an object name. In step 854, the process seeks 
to obtain a lock on the designated key. If a lock cannot be 
obtained, the process waits until one is available. 

When a lock is obtained, the key becomes unavailable for 
use by other processes. In step 856, the process determines 4Q 
which set UOa-lUn of the Directory Table 110 corresponds 
to the key. The process then locates one of the block lists 
132a, 1326 of the Open Directory 130 that corresponds to 
the set of the Directory Table 110. In step 858, the process 
scans the blocks in the selected block list of the Open 45 
Directory 130, seeking a match of the input key to a key 
stored in one of the blocks. 

If a match is found, then in step 864 the process tests 
whether the matching block is currently in the process of 
being created or destroyed by another process. If so, then in 50 
step 866 an error message is returned to the protocol engine 
70 or cache 80 indicating that the current block is not 
available. If the matching block is not currently in the 
process of being created or destroyed, then the block can be 
used. Accordingly, in step 868 the process increments a write 55 
counter. The write counter is an internal variable, stored in 
association with the block, that indicates the number of 
processes or programmatic objects that are writing the 
block. In step 870, the process obtains a copy of the block, 
returns it to the calling program or process, and also marks eo 
the copy as being modified. The marking ensures that any 
changes made to the block will be reflected in the Directory 
Index when the Open Directory is synchronized to the 
Directory Index. 

If a match is not found in the scan of step 858, then in step 65 
860, the process invokes a search of the Directory Index 
using a process that is described further herein. If no match 
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is found in the search, then in step 862 the process returns 
an error message to the calling program or process, indicat- 
ing that the requested object does not exist in the cache. Id 
the World Wide Web context, typically the calling program 
would contact the originating server that stores the object 
using an HTTP request, and obtain a copy of the requested 
object. 

If a match is found during the Directory Index lookup of 
step 860, then in step 874 a corresponding block is added to 
the Open Directory. This is carried out by creating a new 
Open Directory block in main memory; initializing the block 
by copying information from the corresponding Directory 
Index block; and adding a reference to the new block to the 
corresponding list of blocks 132a, 1326. Control is then 
passed to step 868, in which the write count is incremented 
and the process continues as described above in connection 
with steps 868-870. 

9. Checkout Create Operation 

FIG. 8F is a flow diagram of a checkout_create operation 
that is supported for use in connection with the Open 
Directory 130. The checkout„create operation is used to 
create a new block in the Open Directory 130 for a name key 
that corresponds to a new object that is being added to the 
cache. Once the block is created in the Open Directory 130, 
the object can be obtained by users from the cache through 
the Open Directory 130. 

As indicated in FIG. 8F, the checkout_create process 
receives a key value as input. In the preferred embodiment, 
the input key value is a subkey portion of a name key 62 that 
corresponds to an object name. In step 876, the process seeks 
to obtain a lock on the designated key. If a lock cannot be 
obtained, the process waits until one is available. 

When a lock is obtained, the key becomes unavailable for 
use by other processes. In step 878, the process determines 
which set 110a-110/j of the Directory Table 110 corresponds 
to the key. The process then locates the set of the Open 
Directory 130 that corresponds to the set of the Directory 
Table 110, using the set subkey bits of the input key. In step 
880, the process scans the blocks in the selected block list of 
the Open Directory 130, seeking a match of the input key to 
a key stored in one of the blocks. 

If a match is found, then an attempt is being made to 
create a block that already exists. Accordingly, in step 882 
the process tests whether the matching block has been 
marked as deleted, and currently has no other processes 
reading it or writing it. If the values of both the reader 
counter and the writer counter are zero, then the block has 
no other processes reading it or writing it. If the values of 
either the reader counter or the writer counter are nonzero, 
or if the matching block has not been marked as deleted, then 
the block is a valid previously existing block that cannot be 
created. In step 884 an error message is returned to the 
protocol engine 70 or cache 80 indicating that the current 
block is not available to be created. 

If the matching block is deleted and has no writers or 
readers accessing it, then the process can effectively create 
a new block by clearing and initializing the matching, 
previously created block. Accordingly, in step 886 the 
process clears the matching block. In step 888 the process 
initializes the cleared block by zeroing out particular fields 
and setting the block's key value to the key. In block 890, the 
process increments the writer counter associated with, the 
block, and marks the block as created. In step 892, the 
process returns a copy of the block to the calling process or 
programmatic object, and marks the block as being modi- 
fied. 


01/28/2003, EAST Version: 1.03.0002 


US 6,453,: 

35 

If a match is not found in the scan of step 880, then no 
matching block currently exists in the Open Directory 130. 
In step 894, the process carries out a search of the Directory 
Index using a process that is described further herein. If a 
match occurs, then in step 896, the process returns an error 5 
message to the calling program or process, indicating that 
the block to be created already exists in the cache and cannot 
be deleted. 

If no match is found in the search, then no matching block 
currently exists in the entire cache. In step 898, the process io 
creates a new Open Directory block, and adds a reference to 
that block to the list 132a, 1326 associated with the set value 
computed in step 878. Control is passed to step 890, in 
which the processing continues as described above in con- 
nection with steps 890-892. ^ 

10. Checkin Process 

FIG. 9B is a flow diagram of a block check-in process. 
The cache 80 carries out the process of FIG. 9B to check a 
block into the Open Directory 130 after the block is read, 
modified, or deleted. In an embodiment, the process of FIG. 
9B is implemented as a program process or object that 
receives an identifier of a block as a parameter. Because the 
key is present in the checked out block, we do not need to 
pass in the key as an argument. ^ 

In step 930, the process attempts to get a lock for the key 
associated with the block. If no lock is available, then the 
process enters a wait loop until a lock is available. When a 
lock is available, in step 932 the process tests whether the 
block is being checked in after the block has been modified. 3Q 
If so, then in step 934 the writer count for the block is 
decremented, indicating that a process has completed writ- 
ing the block. 

In step 936, the process tests whether the check- in process 
has been carried out successfully. If this test is affirmative, 35 
then in step 942 the process copies the information in the 
current block to the corresponding original block in the 
Open Directory. In this way, the Open Directory is updated 
with any changes that were carried out by the process that 
modified the copy of the block that was obtained in the 40 
checkout process. Thereafter, and if the test of step 936 is 
negative, the process tests whether a delete check-in flag is 
set. The delete check-in flag indicates that the block is to be 
deleted after check-in. The delete flag is an argument to the 
checkin operation. If the flag is set, then in step 944 the 45 
process marks the block as deleted. Processing concludes at 
step 940. 

If the test of step 932 is negative, then the block is not 
being modified. As a result, the only other possible state is 
that the block has been read. Accordingly, in step 946, the 50 
reader count is decremented. 

Implementation of Methods 

In the preferred embodiment, the methods described 
herein are carried out using a general-purpose program- 55 
mable digital computer system of the type illustrated in FIG. 
11. Each of the methods can be implemented in several 
different ways. For example, the methods can be imple- 
mented in the form of procedural computer programs, 
object-oriented programs, processes, applets, etc., in either a eo 
single-process or multi-threaded, multi-processing system. 

In a preferred embodiment, each of the processes is 
independent and re-entrant, so that each process can be 
instantiated multiple times when the cache is in operation. 
For example, the garbage collection process runs concur- 65 
rently with and independent of the allocation and writing 
processes. 
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Hardware Overview 

FIG. 11 is a block diagram that illustrates a computer 
system 1100 upon which an embodiment of the invention 
may be implemented. Computer system 1100 includes a bus 
1102 or other communication mechanism for communicat- 
ing information, and a processor 1104 coupled with bus 1102 
for processing information. Computer system 1100 also 
includes a main memory 1106, such as a random access 
memory (RAM) or other dynamic storage device, coupled to 
bus 1102 for storing information and instructions to be 
executed by processor 1104. Main memory 1106 also may 
be used for storing temporary variables or other intermediate 
information during execution of instructions to be executed 
by processor 1104. Computer system 1100 further includes 
a read only memory (ROM) 1108 or other static storage 
device coupled to bus 1102 for storing static information and 
instructions for processor 1104. A storage device 1110, such 
as a magnetic disk or optical disk, is provided and coupled 
to bus 1102 for storing information and instructions. 

Computer system 1100 may be coupled via bus 1102 to a 
display 1112, such as a cathode ray tube (CRT), for display- 
ing information to a computer user. An input device 1114, 
including alphanumeric and other keys, is coupled to bus 
1102 for communicating information and command selec- 
tions to processor 1104. Another type of user input device is 
cursor control 1116, such as a mouse, a trackball, or cursor 
direction keys for communicating direction information and 
command selections to processor 1104 and for controlling 
cursor movement on display 1112. This input device typi- 
cally has two degrees of freedom in two axes, a first axis 
(e.g., x) and a second axis (e.g., y), that allows the device to 
specify positions in a plane. 

The invention is related to the use of computer system 
1100 for caching information objects. According to one 
embodiment of the invention, caching information objects is 
provided by computer system 1100 in response to processor 
1104 executing one or more sequences of one or more 
instructions contained in main memory 1106. Such instruc- 
tions may be read into main memory 1106 from another 
computer-readable medium, such as storage device 1110. 
Execution of the sequences of instructions contained in main 
memory 1106 causes processor 1104 to perform the process 
steps described herein. In alternative embodiments, hard- 
wired circuitry may be used in place of or in combination 
with software instructions to implement the invention. Thus, 
embodiments of the invention are not limited to any specific 
combination of hardware circuitry and software. 

The term "computer-readable medium" as used herein 
refers to any medium that participates in providing instruc- 
tions to processor 1104 for execution. Such, a medium may 
take many forms, including but not limited to, non-volatile 
media, volatile media, and transmission media. Non-volatile 
media includes, for example, optical or magnetic disks, such 
as storage device 1110. Volatile media includes dynamic 
memory, such as main memory 1106. Transmission media 
includes coaxial cables, copper wire and fiber optics, includ- 
ing the wires that comprise bus 1102. Transmission media 
can also take the form of acoustic or light waves, such as 
those generated during radio-wave and infra-red data com- 
munications. 

Common forms of computer-readable media include, for 
example, a floppy disk, a flexible disk, hard disk, magnetic 
tape, or any other magnetic medium, a CD-ROM, any other 
optical medium, punch cards, paper tape, any other physical 
medium with patterns of holes, a RAM, a PROM, and 
EPROM, a FLASH-EPROM, any other memory chip or 
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cartridge, a carrier wave as described hereinafter, or any performance, as measured by low latency and high through- 
other medium from which a computer can read. put for object store operations, and large numbers of con- 
Various forms of computer readable media may be current operations. The mechanisms described herein are 
involved in carrying one or more sequences of one or more applicable to a large object cache that stores terabytes of 
instructions to processor 1104 for execution. For example, * information, and billions of objects, commensurate with the 
the instructions may initially be carried on a magnetic disk growth rate. 

of a remote computer. The remote computer can load the The object cache takes advantage of memory storage 

instructions into its dynamic memory and send the instruc- space efficiency, so expensive semiconductor memory is 

tions over a telephone line using a modem. A modem local used sparingly and effectively. The cache also offers disk 

to computer system 1100 can receive the data on the 1° storage space efficiency, so that large numbers of Internet 

telephone line and use an infrared transmitter to convert the object replicas can be stored within the finite disk capacity 

data to an infrared signal. An infrared detector coupled to of the object store. The cache is alias free, so that multiple 

bus 1102 can receive the data carried in the infrared signal objects or object variants, with different names, but with the 

and place the data on bus 1102. Bus 1102 carries the data to same content identical object content, will have the object 

main memory 1106, from which processor 1104 retrieves 15 content cached only once, shared among the different names, 

and executes the instructions. The instructions received by The cache described herein has support for multimedia 

main memory 1106 may optionally be stored on storage heterogeneity, efficiently supporting diverse multimedia 

device 1110 either before or after execution by processor objects of a multitude of types with size ranging over six 

1104. orders of magnitude from a few hundred bytes to hundreds 

Computer system 1100 also includes a communication 20 of megabytes. The cache has fast, usage-aware garbage 

interface 1118 coupled to bus 1102. Communication inter- collection, so less useful objects can be efficiently removed 

face 1118 provides a two-way data communication coupling from the object store to make room for new objects. The 

to a network link 1120 that is connected to a local network cache features data consistency, so programatic errors and 

1122. For example, communication interface 1118 may be hardware failures do not lead to corrupted data, 

an integrated services digital network (ISDN) card or a 25 xhe cache has fast restartability, so an object cache can 

modem to provide a data communication connection to a begin servicing requests within seconds of restart, without 

corresponding type of telephone line. As another example, requiring a time-consuming database or file system check 

communication interface 1118 may be a local area network operation. The cache uses streaming I/O, so large objects can 

(LAN) card to provide a data communication connection to ^ be efficiently pipelined from the object store to slow clients, 

a compatible LAN. Wireless links may also be implemented. without staging the entire obj ect into memory. The cache has 

In any such implementation, communication interface 1118 support for content negotiation, so proxy caches can effi- 

sends and receives electrical, electromagnetic or optical ciently and flexibly store variants of objects for the same 

signals that carry digital data streams representing various URL, targeted on client browser, language, or other attribute 

types of information. ^ 0 f the client request. The cache is general purpose, so that 

Network link 1120 typically provides data communica- the object store interface is sufficiently flexible to meet the 

tion through one or more networks to other data devices. For needs of future media types and protocols, 

example, network link 1120 may provide a connection The f orego ing advantages and properties should be 

through local network 1122 to a host computer 1124 or to regarded as features of the technical description in this 

data equipment operated by an Internet Service Provider ^ document; however, such advantages and properties do not 

(ISP) 1126. ISP 1126 in turn provides data communication necessarily form a part of the invention, nor are they 

services through the world wide packet data communication required by any particular claim that follows this descrip- 

network now commonly referred to as the "Internet" 1128. ^ on 

Ural network 1122 and Internet 1128 both use electrical, ^ ^ fo { specifica ti 0 n, the invention has been 

electromagnet or optical signals that carry digital data 4J described ^ refcrence to fflc embodiments , hereof 

streams. The signals through the various networks and the . ... c . , * , , T . 

& , ,. , t , . and with reference to particular goals and advantages. It 

signajs on ^ network link 1120 and ^through communication ^ fao be evident ^ modifications ^ 

interface 1118, which carry the digital data to and from Q& b& ^ d { from ^ 

computer system 1100, are exemplary forms of carrier broader Mt ^ Qf ^ invention ^ specification 

waves transporting the information. 5Q and ^ ^ acc ' ordingly? t0 be regarded m an iUus . 

Computer system 1100 can send messages and receive trative ralher ±sm a restr ictive sense. 

data, including program code, through the network(s), net- What is claimed is: 

work link 1120 and communication interface 1118. In the x A method of managing a p i ura lity of counters stored in 

Internet example, a server 1130 might transmit a requested a compute,- mem ory, comprising the steps of: 

code for an application program through Internet 1128, ISP c< t . . . , , „ t * • 

, t „ , j • *• • * It 55 computing a sum by summing values actually stored in 

1126, local network 1122 and communication interface *\ f iL , *. r * J 

' f . ... ..... LJ each of the plurality of counters; 

lll8. In accordance with the invention, one such down- r J 

loaded application provides for caching information objects determining the maximum value that can be stored in each 

as described herein. of ^ P^iiy °f counters; 

The received code may be executed by processor 1104 as 60 determining a maximum sum value by summing the 

it is received, and/or stored in storage device lllO, or other maximum values that can be stored in each of the 

non-volatile storage for later execution. In this manner, plurality of counters; 

computer system llOO may obtain application code in the determining a threshold value based on the maximum sum 

form of a carrier wave. value; and 

Accordingly, an object cache has been described having 65 decrementing each of the counters when the sum is 

distinct advantages over prior approaches. In particular, this greater than the threshold value that is based on a 

document describes an object cache that offers high maximum sum value. 
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2. The method recited in claim 1, further comprising the 
steps of: 

storing each of the plurality of counters in association 
with a description of one of a plurality of information 
objects stored in a cache; and 5 

deleting one of the information objects that is associated 
with one of the counters when the counter is decre- 
mented to zero. 

3. The method of claim 1, wherein the threshold value is 

a particular percentage of the maximum sum value. 10 

4. The method of claim 3, wherein the threshold value is 
equal to one half of the maximum sum value. 

5. The method of claim 1, wherein: 

each counter of said plurality of counters is associated 15 
with a corresponding information object in a cache; and 

the steps further include incrementing a particular counter 
of said plurality of counters when the corresponding 
information object is accessed. 

6. The method of claim 5, wherein: 2 o 
each counter of said plurality of counters is associated 

with a corresponding information object in a cache; 
the plurality of counters includes a first counter associated 

with a corresponding first information object; 
the step of decrementing each counter includes decre- 25 

menting said first counter to a threshold counter value; 

and 

the steps further include deleting the first information 
object when the first counter has been decremented to 3Q 
the threshold counter value. 

7. The method of claim 6, wherein the threshold counter 
value is zero. 

8. The method claim 1, further including identifying a set 

of information objects as least recently used based on the 35 
values of said plurality of counters. 

9. A computer-readable medium carrying one or more 
sequences of one or more instructions for managing a 
plurality of counters stored in a computer memory, the one 

or more sequences of one or more instructions including ^ 
instructions which, when executed by one or more 
processors, cause the one or more processors to perform the 
steps of: 

computing a sum by summing values actually stored in 

each of the plurality of counters; 45 
determining the maximum value that can be stored in each 

of the plurality of counters; 
determining a maximum sum value by summing the 

maximum values that can be stored in each' of the 

plurality of counters; 
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determining a threshold value based on the maximum sum 
value; and 

decrementing each of the counters when the sum is 
greater than the threshold value that is based on a 
maximum sum value. 

10. The computer-readable medium recited in claim 9, 
further comprising sequences of instructions for performing 
the steps of: 

storing each of the plurality of counters in association 
with a description of one of a plurality of information 
objects stored in a cache; and 

deleting one of the information objects that is associated 
with one of the counters when the counter is decre- 
mented to zero. 

11. The computer-readable media of claim 9, wherein the 
threshold value is a particular percentage of the maximum 
sum value. 

12. The computer-readable media of claim 11, wherein the 
threshold value is equal to one half of the maximum sum 
value. 

13. The computer-readable media of claim 9, wherein: 
each counter of said plurality of counters is associated 

with a corresponding information object in a cache; and 
the steps further include incrementing a particular counter 
of said plurality of counters when the corresponding 
information object is accessed. 

14. The computer-readable media of claim 13, wherein: 

each counter of said plurality of counters is associated 
with a corresponding information object in a cache; 

the plurality of counters includes a first counter associated 
with a corresponding first information object; 

the step of decrementing each counter includes decre- 
menting said first counter to a threshold counter value; 
and 

the steps further include deleting the first information 
object when the first counter associated with the infor- 
mation object has been decremented to the threshold 
counter value. 

15. The computer-readable media of claim 14, wherein 
the threshold counter value is zero. 

16. The computer-readable media of claim 9, wherein the 
steps further include identifying a set of information objects 
as least recently used based on the values of said plurality of 
counters. 
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